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Coding instead of encryption 

Dr. Yaseen Hikmat Ismaiel 
Lecture 

Department computer science 
Collage of computer and mathematic science 

Mosul University 
Yasino79@vahoo.com 


ABSTRACT 

Due to the increased use of the internet in recent decades and the large number of 
transactions and data exchanged over the network . There is an urgent need to provide 
security for those data, especially those related to military, commercial, or financial 
exchanges. Many cryptographic methods have recently been widely investigated and 
developed because there is a demand for a stronger encryption and decryption which is 
very hard to crack. Most modern encrypting methods include many substitution and 
iteration processes that encountered some problems such as lack of robustness and 
significant amount of time. 

In this research, a proposed method based on the idea of coding was introduced to 
achieve security instead of encryption. This search uses ASCII to build a coding table in a 
different way to provide security and saves effort, time and cost. 

Keywords : coding theory , code book , encryption. 

1. Introduction : 

It is well known that cryptographic systems use a key and algorithm to convert 
plain text into ambiguous (cipher) text, while encoding methods include converting plain 
text to another format using a particular method or special coding table. The purpose of 
coding systems may not be to achieve security but to convert data to another format and 
use it or take advantage of them in a particular area. Examples of coding methods include 
ASCII, Unicode , URL encoding , base64 [1] [2] [3] [4], 

This paragraph includes a review of the studies and research in this field, where the 
researcher in 1978 McEliece [5] using the fact that a fast decoding algorithm exists for 
general Goppa code, he constructed a public-key cryptosystem using algebraic coding 
theory . In 2005 Grangetto M. and et al [6] introduced a randomized arithmetic coding 
paradigm, which achieves encryption by inserting some randomization in the arithmetic 
coding procedure , the proposed approach allows very flexible protection procedures at 
the code-block level, allowing to perform total and selective encryption, as well as 
conditional access. In 2010 Wong K. and et al [7] proposed a simultaneous compression 
and encryption scheme in which the chaotic map model for arithmetic coding is 
determined by a secret key and keeps changing. In 2010 Lu R. and et al [8] proposed an 
efficient INDCCA2-secure public key encryption scheme based on coding theory, and 
then measure the efficiency of the method by comparing with the syndrome decoding 
problem in the random oracle model. In 2012 Gupta V. and et al [9] proposed an improved 
block cipher symmetric encryption algorithm that has the same structure of encryption and 
decryption by inserting the symmetric layer; The method gave high efficiency in term of speed. 
In 2013 Al-Hazaimeh O. [10] presented a new approach based on parallel programming to 
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provide complexity in encryption and decryption process ; it offers a high level of security with 
good speed . 

2. Coding and Encryption System : 

Coding systems originated long ago and require a mathematical background for the 
purpose of building a good coding theory. Coding system rely on the use of the dictionary (table) 
without the need to use a complex key or algorithm, its purpose is to convert data to another 
format so that it can be used in many applications. [3] [11] 

The objective is different from using coding systems; some applications are designed to 
achieve confidentiality, some to data compression, or to convert data into a specific format that 
can be used in specific applications. Coding system divided into two types one-part and two-part 
codes, in the one part type the code groups and the vocabulary are arranged in parallel, alphabetic 
sequences, so that a single book serves for encoding as well as for decoding. In the two-part type 
the encoding book lists the elements of the vocabulary in alphabetic order but the code groups are 
in random order, so that a decoding book, in which the code groups appear in alphabetic (or 
numerical) order accompanied by their meaning, is essential. The degree of secrecy afforded by a 
code of the latter type is much greater than that afforded by one of the former type, all other 
things being equal. [3] [4] [11] 

Encryption is the process of encoding a message or information in such a way that only 
authorized parties can access it and those who are not authorized cannot. Encryption is one of the 
most important methods for providing data privacy(confidentiality) , authenticity (it came from 
where it claims) , and data integrity(it has not been modified on the way) in the digital world , 
especially for end-to-end protection of data transmitted across networks. There are many types of 
encryption systems and certain measures have been used to divide them, it is divided in terms of 
the nature of its work into transposition and substitution encryption systems, or divided in terms 
of how to use the key to secret and public encryption systems. [1] [2] [12] [13] 

3. Proposed Method : 

The proposed method involves using the concept of coding as a good alternative 
to encryption methods in terms of achieving high level of security and cost savings. It is 
known that the coding methods used to achieve confidentiality depend on the building of 
the coding book, which serves as a key to the coding. The mechanism of these methods is 
to take a word after a word of the plain text and search it in the coding book and then get 
the word encoded thus configuring the encoded text. The proposed method used the 
encoding process in a different way depending on ASCII encoding (ASCII printable 
characters) where the coding table (Table 1) was constructed by using the following 
rules: 

S Two codes were assigned to the most frequently used characters (E, A, I, N, O, S, 
and T) to make the resulting symbols in close frequency ratios and thus eliminate 
the frequency analysis. 

S Codes for commonly used pronouns are assigned (I, me, my, you, your, he, him, 
she, her, it, its, we, us, our, they, them, their). 

S Encoding question tools (who, what, where, when, why, how, and which). 

S Encoding numbers (0-9). 

S Two codes were assigned to the space also to eliminate the frequency analysis. 

S Codes for commonly used suffixes and prefixes are assigned (able, fully, sion, 
tion, er, ing, ed, pre, un, re). 


2 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


■f Encoding week days (Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, 
and Saturday). 

S Encoding commonly used prepositions (in, on, and at). 

S Encoding commonly used helping verbs (is, are, has, have, do, does). 

3.1. The Proposed Method Algorithm : 

The algorithm of the proposed method can be summarized by the following points: 

1. Read the text to be encoded. 

2. Calculate the number of words in clear text based on the existence of spaces. 

3. Search for pronouns, numbers, spaces, weekdays, prepositions, and helping verbs in 
the clear text and compensate them with the corresponding symbols using table 1 (note 
the use of the symbols allocated to the space in a sequential manner). 

4. Encode the suffixes and prefixes wherever they are in the remaining words and then 
encode the characters of those remaining words, taking into account when encoding the 
most frequent characters (E, A, I, N, O, S, and T) using the two codes assigned to them in 
succession. 

5. After finishing the encoding process, the number that represents the number of words 
in the text is clearly (step 2) encoded and added to the end of the resulting coded text. 
This number will be used by the recipient to verify the integrity of the received text. 

The mechanism of the proposed algorithm can be illustrated in the flowchart in 
figure 1. 

In the receiver, the decoding algorithm can be explained to obtain clear text 
through the following points: 

1. Get the codes at the end of the encoded text and decode them to extract the number 
that represents the number of clear text words and keep it. 

2. Use table 1 to decode all words, sections and characters to get clear text. 

3. Calculate the number of clear text words generated and compare them with the number 
obtained in step 1 if the values are equal, indicating that the received text is correct and 
the decoding process was done correctly. The flowchart algorithm can be illustrated in 
figure 2. 
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Table (1) The coding table used in the suggested method 


a 

chacter 

plain 

ascii 

chacter 

plain 

ascii 

chacte 

r 

plain 

32 


A 

64 

% 

Z 

96 


2 

33 

! 

A 

65 

A 

I 

97 

a 

1 

34 

// 

B 

66 

B 

ME 

98 

b 

0 

35 

# 

C 

67 

C 

MY 

99 

c 

SPACE 

36 
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Figure 1 proposed algorithm encoding 



Figure 2 proposed algorithm decoding 
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4. Conclusion and further work : 

V The use of the coding book in a different way so that some of the words and 
sections (commonly used in English language) are encoded and then encoding the 
characters of words that are not in the coding book by encoding (suffixes and 
prefixes) and then the rest of the word’s characters. 

V Calculate the number of words in clear text based on the existence of spaces and 
inserting the encoded number to the end of the coded text, It provided an 
innovative method for verifying integration at the receiving party. 

V Assigned two codes to the most frequently used characters (E, A, I, N, O, S, and 
T) and spaces to make the resulting symbols in close frequency ratios and thus 
eliminate the frequency analysis. 

V The proposed method provided confidentiality and saves effort, time and cost. 

V The size of the coding book that was built in the proposed method is relatively 
small "and provided a high level of confidentiality for the resulting coded text. 

V The length of the resulting encoded text is considered to be significantly lower 
than the length of the plain text, especially when the text is relatively long, so the 
proposed method provided a good compression ratio in addition to confidentiality. 

V In the future, it is possible to use intelligent techniques in constructing a code 
book, or to use it for the purpose of converting clear text into another form with 
the same sense so that the largest number of words have symbols in a book or 
coding table. 
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Abstract - With the development of information security, the traditional image encryption methods have become 
outdated. Because of amply using images in the transmission process, it is important to protect the confidential image 
data from unauthorized access. This paper presents a new chaos based image encryption algorithm, which can improve 
the security during transmission more effectively utilizes the chaotic systems properties, such as pseudo-random 
appearance and sensitivity to initial conditions. Based on chaotic theory and decomposition and recombination of pixel 
values, this new image scrambling algorithm is able to change the position of pixel, simultaneously scrambling both 
position and pixel values. Experimental results show that the new algorithm improves the image security effectively to 
avoid unscramble, and it also can restore the image as same as the original one, which reaches to the purposes of image 
safe and reliable transmission. 

Keywords: Color image, chaotic system, decomposition, image scrambling, recombination 

I. Introduction 

Recently, security of multimedia data is receiving more and more attention due to the transmission over 
various communication networks. In order to protect personal information, many image encryption algorithms 
are designed and proposed such as two-dimensional cellular automata based method [2], Henon chaotic map 
[10, 13], Chen's hyperchaotic system [12], Arnold transform [3, 4]. Chaotic functions are blessed with properties 
like sensitivity to the initial conditions, and ergodicity which make them very desirable for encryption [1]. 

Image scrambling is one of the methods for securing the image by scrambling it into a disordered one beyond 
recognition, making it hard for those who get the image in unauthorized manner to extract information of the 
original image from the scrambled images. Further, image scrambling technology depends on data hiding 
technology which provides non-password security algorithm for information hiding. Now, the mainly used three 
kind of image scrambling types are scrambling in the space domain, scrambling in the frequency domain, and 
scrambling in the color or grey domain. In a great quantity of all kind of image scrambling algorithms, the 
image scrambling algorithms based on chaos have attracted more and more attention since they can provide a 
high level of security [5, 6, 7, 8]. 
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This paper focuses on a new image scrambling algorithm which introduces a new chaotic system. Image 
scrambling using chaotic properties is an application for providing security to the images from getting into the 
hands of unauthorized user. The proposed image scrambling scheme generates the permuting address codes by 
sorting the chaotic sequence directly. This paper analyzed that the scrambling performance of the new algorithm 
is statistic. The conclusion of this paper indicates that the new algorithm can provide a high level security. The 
paper results in good performance of the proposed algorithm that can also be applied in the real-time 
applications and digital communications as it is a straightforward mechanism and easy to implement. 

The rest of the paper is organized as follows: proposed chaotic system in section 2, image scrambling 
algorithm based on chaos theory in section 3, experimental details and results are analyzed in section 4. The 
paper is observed by a conclusion in section 5. 


II. PROPOSED CHAOTIC SYSTEM 


In this section, we describe the new chaotic system used in this work. 

2.1. New Chaotic System 

Recently, Chen and Lee [9] introduced a new chaotic system, which is described by the following nonlinear 
differential equation: 


x t = ax 1 — x 2 x 3 
x 2 = — bx 2 + x 1 x 3 
x 3 = cx 3 + j x ± x 2 


( 1 ) 


W = Cx 


Where: 

- x l9 x 2 and x 3 are the state variables and a, b and c are positive constants. 

- C = (10 0). 

- W is the system measured output. 

When a = 5.5 , b = 11 and c = 4, the system (1) is chaotic. 

2.2. Lyapunov exponent 

By linearizing the Jacobian matrix J E round the equilibrium point E and solving the following equation: 

\h-J E \ = 0 (2) 

Therefore, the new chaotic system (1) has three eigenvalues shown in figure 1. 

A ± = 5.500 X 2 = -10.994 X 3 = - 3.999 
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Dynamics of Lyapunov exponents 



Time 

Figure 1: Lyapunov exponent of new chaotic system 

2.3. Sensitivity to initial conditions 

Sensitivity to initial conditions means that each point in a chaotic system is arbitrarily closely approximated 
by other points with significantly different future paths, or trajectories. Thus, an arbitrarily small change, or 
perturbation, of the current trajectory may lead to significantly different future behavior. The next figure 
compares the time series for two litely different initial conditions. The two time series stay close together for 
about 2 iterations. But after that, they are pretty much on their own. 



(a) 
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Figure 2: Sensitivity to two initial conditions [-2, 2, 1] and [-1.9, 1.9, 0.9] 
(a): x x (b): x 2 (c): x 3 


III. IMAGE SCRAMBLING ALGORITHM BASED ON CHAOS THEORY 
3.1. Proposed algorithm 

The proposed chaotic system is now used in the design of color image encryption algorithm. The proposed 
images encryption algorithm input is an original image whilst the output is a scrambled one. Figure 3 illustrate 
the proposed algorithm scheme. 


RGB three-color Chaotic RGB three-color 

separation block Generator Key combination block 



i 

/ \ 

Scrambled color 
Image 


Figure 3: Principle of chaotic scrambling algorithm for color image 
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IV. EXPERIMENTAL DETAILS AND RESULTS 

A good quality scrambled algorithm should be strong against all types of attack. Some experiments are given 
in this section to demonstrate the efficiency of the proposed technique. In this section, the proposed technique is 
applied on two color images "Gallery" and "Alice", of resolution of " 256*256". We analyze the results by 
calculating histogram and correlation coefficient, to test the performance of the proposed technique. The next 
figures show the results of scrambled algorithm. 




Figure 4: "Alice" image corresponding for different step of the scrambling process 
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Figure 5: "Gallery" image corresponding for different step of the scrambling process 


4.1. Statistical analysis 

In order to resist attacks, the scrambled images should possess certain random properties. To prove the 
robustness of the proposed algorithm, a statistical analysis has been performed by calculating the histograms and 
the correlation coefficients for the original image and the scrambled image. For the two images that have been 
tested, it has been determined that their quality is good. 
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4.1.1. Histogram Analysis 

An image histogram is a commonly used method of analysis in image processing. The advantage of a 
histogram is that it shows the shape of the distribution for a large set of data. Thus, an image histogram 
illustrates how pixels in an image are distributed by plotting the number of pixels at each color intensity level. It 
is important to ensure that the encrypted and original images do not have any statistical similarities. 

The experimental results of the original image and its corresponding scrambled image and their histograms 
are shown in Fig. 6. The histogram of each original image illustrates how the pixels are distributed by graphing 
the number of pixels at every color of RGB [14]. It is clear that the histogram of the scrambled image is 
different from the respective histograms of the original image. 



(a) (b) 

Figure 6: "Alice" image histogram in three channels RGB 
(a): Original (b): Scrambled 
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(a) (b) 

Figure 7: "Gallery" image histogram in three channels RGB 
(a): Original (b): Scrambled 


4.2. Correlation of two adjacent pixels 

In addition to the histogram analysis, we have also analyzed the correlation between two vertically adjacent 
pixels, two horizontally adjacent pixels and two diagonally adjacent pixels in plain image and cipher image 
respectively. 

A correlation is a statistical measure of security that expresses a degree of relationship between two adjacent 
pixels in an image or a degree of association between two adjacent pixels in an image. The aim of correlation 
measures is to keep the amount of redundant information available in the scrambled image as low as possible 
[ 11 , 15 ]. 
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Equation (3) is used to study the correlation between two adjacent pixels in the horizontal, vertical, diagonal 
and anti-diagonal orientations: 


c = (3) 
2}U xj-G-U *?-(£]=i y J 2) 

where x and y are the intensity values of two adjacent pixels in the image and N is the number of adjacent 
pixels selected from the image to calculate the correlation. Results for the correlation coefficients of two 
adjacent pixels are shown in tables land 2. 

In the experiments results, 3000 pairs of two adjacent pixels are randomly selected. Fig. 8 shows the 
distribution of two adjacent pixels in the original image and the encrypted-image. There is very good correlation 
between adjacent pixels in the image data [16, 17], while there is only a small correlation between adjacent 
pixels in the scrambled image. 



300 
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(a) (b) 

Figure 8: Horizontal, vertical and diagonal correlation of original and scrambled "Alice" image 
(a): Original image (b): Scrambled image 
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(b) (b) 

Figure 9: Horizontal, vertical and diagonal correlation of original and scrambled "Gallery" image 
(a): Original image (b): Scrambled image 


Table I 

Correlation coefficient corresponding to original and scrambled images 


Direction 

Horizontal 

Vertical 

Diagonal 

Alice image 

Original 

0.9759 

0.9855 

0.9677 

Scrambled 

0.5632 

0.5656 

0.5612 

Gallery image 

Original 

0.9878 

0.9777 

0.9679 

Scrambled 

0.3977 

0.3921 

0.3832 
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Table II 

Correlation coefficient corresponding to original and recovered images 


Direction 

Horizontal 

Vertical 

Diagonal 

Alice image 

Original 

0.9759 

0.9855 

0.9677 

Renew 

0.9759 

0.9855 

0.9677 

Gallery image 

Original 

0.9878 

0.9777 

0.9679 

Renew 

0.9674 

0.9665 

0.9290 


4.3. PSNR 

Peak Signal to Noise Ratio (PSNR) criterion is used to test the unobservable factor. This measure indicates 
the degree of similarity between the watermark images and a watermark images. PSNR is expressed 
mathematically in the following form: 


PSNR[dB] = 10 log 10 (- 


255 z 


t) 


K EQM(I 0 ,I R y 

where EQM is the mean square error between the two images (/ 0 original, I R recovered). 


m-1n-1 

EQM(I oi I r ) = V V (/„(*, y) - / R (x,y)) 2 

mn z—k z—k 


x=Q y =0 


(4) 


To recover the two images, we apply the inverse of the proposed algorithm in figure 2. The result is shown in 
figure 10. 
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PSNR high means: Mean square error between the original image and reconstructed image is very low. It 
implies that the image been properly restored. In the other way, the restored image quality is better; in our case, 
the values of PSNR are as follow: 


PSNR (Alice) = Inf 
PSNR (Gallery) = 51.75 

The result is much closed with the correlation coefficient. 

For "Alice", the correlation coefficient for the original and renew image are identical. The PSNR equal to 
INF, that means the renew image is identical to original image. 

For "Gallery", the correlation coefficient for the renew image is at 40% of the original image, that justify 
the corresponding PSNR value. 


V. CONCLUSION 

In this paper, a new image scrambling algorithm, by using image scrambling to encrypt the image to improve 
the security of image. The new algorithm based on chaotic system and decomposition and recombination of 
pixel values is able to scramble pixel positions and pixel values of images. Analysis of the statistical information 
of scrambled images in the experimental tests shows that the present algorithm provides reasonable security. 
Owing to the strong irregularity of the sorting transformation that improves the effect of the scrambling. The 
experimental results show that the algorithm is effective to scramble the image and can provide high security. It 
simulates scrambling under Matlab 7 to confirm it. 
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Abstract 

With the growth of cloud technologies, computing 
resources and cloud storage have become the most 
demanding online services. There are several companies 
desiring to outsource their data storage and resources as 
well. While storing private and sensitive data on a third 
party data center, it is necessary to consider security and 
privacy which become major issues. In this paper, a novel 
Double Encryption with Single Decryption (DESD) crypto 
technique is proposed to secure the data in cloud storage. 
The proposed technique comprises of encryption and 
decryption phases where in the encryption phase the data is 
randomly partitioned into multiple fragments. Double 
encryption is done on each fragment by prime numbers, as 
well as Invertible Non-linear Function (INF). These 
multiple encrypted data are stored at the multiple cloud 
storages with the help of cloud service provider (CSP). 
After all verification process the data user collects the key 
from the data owner and decrypts the gathered data from 
the cloud with the knowledge of inverse INF. The proposed 
crypto technique provides more security and privacy to 
cloud data and any illegitimate users cannot retrieve the 
original data. The performance of the proposed DESD 
technique is compared with AES and Triple DES 
techniques and the experimental results are plotted which 
shows the proposed technique is efficient and faster. 

Key words: Cloud computing, cloud service 
provider, DESD crypto technique, Invertible Non¬ 
linear Function, AES and Triple DES. 

1. Introduction 

In recent years, this fast growing innovative 
technology offers users with several paperless 
services which are available online, for example, e- 
banking, e-billing, e-mail, e-shopping and e- 
transaction etc. These paperless services need data 


exchange through online. This data might be any 
personal or sensitive information such as credit or 
debit card details, business secrete information, 
banking transactions and so on. These kinds of 
information need more security as disclosure of such 
personal data to any illegitimate user can produce 
extremely hazardous consequences. There is a high 
necessity for user’s security while exchanging their 
personal information through un trusted networks. 
Thus, it is necessary to develop a security mechanism 
for converting user’s personal or sensitive 
information to some other unreadable format. While 
sending such information it is essential to build it 
harder for intruders to collect some observed 
information. Cryptography is one of the techniques to 
achieve it. 

In cloud computing ,user’s data (i.e. data owner) is 
stored at some untrusted third party that needs 
extreme protection as data owner does not possess 
any physical access on the information. Data privacy 
and security of user or owner are consistently a vital 
issue in cloud computing (Dai Yuefa et al (2009), 
Mohit Marwahe and Rajeev Bedi (2013)). There are 
several advantages such as low cost and easy access 
on data provided by the cloud but privacy and 
security problems is of concern while storing user’s 
personal and sensitive data to cloud storage (M. 
Mohamed et al (2013)). Data in cloud storage might 
be attacked in two manners such as inside or outside 
attack (L. Arockiam and S. Monikandan (2013)). If 
an attacker attempts to access the cloud data while in 
transition or at rest which is not legitimized, then it is 
known as outside attack. An attack from the cloud 
administrator side is defined as inside attack. When 
compared to the outside attack, the inside attack is 
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really hard to identify and the data owner or user 
must be very careful while storing and retrieving 
their personal data to or from the cloud storage. 
Moreover the retrieved data by the authorized user 
from the cloud should not be in actual format as there 
is high possibility of outside attack. Therefore all the 
data must be converted into unreadable format by 
encrypting before storing it in cloud; then its actual 
format is revert back by decryption. This should be 
possible with the aid of cryptography. 

Cryptography is classified into two techniques 
namely “code making” and “code breaking”. The 
code making involves to covert a message or data 
into other incomprehensible/unreadable format to 
secure it from any malicious activity of malicious 
users whereas the code breaking provides the solution 
known as cryptanalysis (Chris Christensen (2006)). 

The major objective of cryptography is preventing 
intruders from obtaining the actual data and permits 
only legitimate users to obtain the correct information 
without any modification. Utilization of 
cryptographic strategies guarantees the user’s 
personal information remains secure from any 
changes and illegitimate users. These illegitimate 
users cannot break encrypted code of original 
information while legitimate users only have the 
authority to revert back the translated information 
into actual format (Sinkov A (1996)). The entire 
process of conversion of original data and reversing 
back the exact data is called as encryption and 
decryption respectively. 

This paper proposed a novel Double Encryption with 
Single Decryption (DESD) crypto technique to 
protect cloud data. Data of large volume is split into 
number of small fragments by data partitioning 
process. Then each partition is subjected to 
encryption and here double encryption is done. The 
first encryption is accomplished with prime numbers. 
For that the prime numbers are generated randomly 
and the number of generated prime numbers equal to 
twice the number of partitioned data when the data 
owner wants to produce four encrypted forms. Based 
on the interest of data owner he/she can produce 8 or 
16 or 32 encrypted forms for a single data part. Then 
complements of all primes are computed. So each 
partition is encrypted with a prime and its 


complement. After the first encryption a large integer 
is generated and is divided into number of small 
integers which is equal to the number of data 
partitions. Each small integer is added with each 
encrypted data. Each resultant cipher is then 
subjected to another encryption using invertible non¬ 
linear function (INF) which has two random integers. 
The second encryption is achieved by multiplying 
each data partition with the first integer and added 
with the second integer. At user end a single 
decryption is enough to decrypt the data and the key 
is subtracted to retrieve original data where the key is 
a large integer value. An important thing is that the 
data user must have the knowledge of the inverse 
invertible non-linear function for decryption. 

The rest of the paper is organized as follows: Section 
2 presents the related works on cryptographic 
techniques and section 3 presents the problem 
definition. In section 4, the proposed method is 
presented in detail. Section 5 deals with the 
experimental results and in section 6 the paper is 
concluded with scope for future work. 

2. Related work 

V.Masthanamma, G.Lakshmi Preya (2015) examine 
about the usage of cryptography schemes, to enhance 
the security of encrypted data that is sent by the cloud 
users to cloud server. The fundamental goal is to 
perform encryption and decryption of data in a 
secured way with consumption of very less time and 
low cost for both encoding and decoding process. 
Various amounts of keys are produced and repeated 
attacks are observed. Thus by repeating the strategy it 
assists the data to remain safe against the attacks to 
extend the security of decoded data that is sent by the 
cloud users to cloud server. 

H.Y. Lin and W.G. Tzeng (2012) presented a 
threshold proxy re-encryption scheme in which data 
security is accomplished using decentralized erasure 
code. This makes the system stronger and privacy 
issues of cloud service provider (CSP) are solved. 
Here the data is stored in a cloud storage server in 
encrypted format and when a user requests the data, 
the data holder sends the re-encryption key to the 
server that again encrypt the same data for requested 
user. The authors consider that the cloud storage 
comprises of storage and key servers where in 
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storage server the data storing operation are 
performed. In order to decrypt the encoded and 
encrypted data with n codeword symbols, each key 
server has to independently perform partial 
decryption alone. 

Pancholi et al (2016) have presented the method of 
using diverse parts by the ciphers and its converse for 
all purposes eliminates the keys in AES that is the 
drawback of DES. In AES, the likelihood of 
proportionate keys is taken out for nonlinearity of the 
key extension for all purposes. For several 
microcontrollers an implementation correlation 
among AES, DES and Triple DES exhibits that AES 
and Triple DES require a PC expense of the same 
request. Another execution evaluation reveals that 
AES has great status over estimations 3DES, DES 
and RC2 to the extent of execution time with 
different package size and throughput for encoding 
and furthermore decoding. Similarly changing data 
order, for instance, image instead of content, it has 
been discovered that AES possess a benefit over 
Blowfish, RC2 and RC6 with respect to time usage. 

K. Nasrin, et. al. (2014) dealt with cloud storage 
framework which is the most important research area 
in cloud computing in which security is considered as 
one of the vital concerns. The authors combined the 
asymmetric and symmetric key approaches utilizing 
AES and RSA algorithms and derived a novel 
mechanism. AES is useful for key sharing and less 
overhead cryptographic technique and complex 
phenomena is created by RSA to provide security 
from attackers. The main attention of the attackers 
was on demonstrating secure data communication 
from defenseless or vulnerable networks. 

Jayant, D. et al. (2015) presented a novel mechanism 
called role base access control by applying AES and 
RSA algorithm for providing a secure 
communication environment for open cloud 
environments. The authors used RSA and AES 
algorithms for the purpose of encryption and 
decryption where access control is achieved using 
RBAC mechanism. According to the model of RBAC 
the uploading rights and several rights to several 
users were given. 

In this paper, a novel DESD crypto technique is 
proposed to provide privacy and security for 


confidential and sensitive data stored in cloud server. 
It requires less computation time with low cost. It 
also provides better protection against intruders and 
malicious activities with faster operations. 

3. Objective and issues 

The main objective of this paper is to design an 
efficient cryptographic technique which is simple and 
consumes less time to perform encryption and 
decryption operations on data stored in cloud. The 
encrypted data should require limited space for 
storage. Some of the following privacy and security 
issues are rectified. 

> Access control: Failure of CSP may happen 
at some situation on cloud environment that 
leads the chances of intruders and malicious 
activities. 

> Lack of user control: In cloud user data is 
stored at some remote location and its 
complete control is taken by CSP i.e. the 
user has no control on its data. 

> Control policy: The CSP may have self- 
interest on user’s data at some network 
conditions. Thus it is necessary to 
implement security mechanism for CSP to 
provide control policy in the cloud 
environment. 

4. Proposed Methodology 
4.1. System Model 

The proposed system model comprises of 
four entities such as data owner, cloud 
service provider (CSP), cloud storage and 
data user. Figure 1 illustrates the system 
model of the proposed work and its flow of 
operation is also explained. 
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1. Send Encrypted data 

2. Store Encrypted data in 
cloud 

3. Send Data Access Request 

4. Provide key & Certificate 

5. Send Access Request & 
Certificate 

6. Command to provide 
Encrypted data 


Figure 1: Proposed system model 


The operation of data owner is to collect data in any 
form like image, text, audio and video which will 
then be building a document index which is 
partitioned into several numbers of small fragments. 
Each fragment is encrypted multiple times and 
outsourced to the cloud storage. The cloud service 
provider (CSP) is responsible to allocate available 
space for outsourced data at different storage location 
of a single cloud or different clouds. The CSP has the 
complete control on the cloud storage i.e. once the 
data is stored in the cloud its complete control is 
taken by the CSP. Here ‘N’ number of clouds is taken 
to store the user’s data. In cloud storage, the ciphers 
are stored in allocated storage space. If the data user 
wants to access data in cloud he/she must be verified 
by the CSP to verify his/her authorization. If he/she is 
an authorized user then it allows sending data access 
request to the data owner. The data owner responds 
the request by sending authentication certificate with 
a decryption key. By verifying the gathered 
authentication certificate the CSP command the 
storage to provide data. Finally the decryption is done 
by the collected key from the data owner. 

4.2. Detailed contribution 

The detailed contribution of the proposed work is 
explained through the block diagram demonstrated in 
Figure 2. This block diagram comprises of three 
major blocks such as data owner, CSP and data user. 
Each of its operations is explained below in a detailed 
manner. 


Data Owner Cloud Service Provider Data User 



Figure 2: Block diagram 

The data owner comprises the data in the form of 
plaintext which is large size. So it is partitioned into 
multiple small fragments. There are several 
advantages in data partitioning such as: 1) Processing 
of large volume of data makes the operation complex. 
2) Uploading and downloading of these small 
fragments requires relatively very less time. 3) These 
are very easy to access. Then each partitioned 
fragment is encrypted with prime numbers and its 
complements for multiple times. With the help of 
Inverse Non-linear Function (INF) second encryption 
is accomplished which produces ciphers with 
unreadable format. These are outsourced to the 
different locations at same cloud storage and different 
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cloud storage. For that the CSP has to process these 
data to allocate available space for storage. After all 
verification process the CSP command the cloud to 
provide the data stored at different locations. The 
data user merges all the collected encrypted data and 
performs inverse INF. Finally to get the plaintext the 
key is subtracted from the previous results. 


Symbol 

Description 

x i 

Plaintext 

i 

Number of split or partitioned data 

i = 1,2 

p N 

Number of generated primes 

pm 

Set of prime numbers 

p i C ’Qi 

Prime complements 

p 

Bit of the prime. Here p = 32 

A 

Random Integer 

d n 

Number of split of random integer 

fi 

Generated ciphers 

s, 

Sum of encrypted form 


4.3. Encryption 

The large volume of data is to be stored in the 
cloud effectively. So the large volume of data is 
partitioned into number of small partitions or 
fragments at the first step of encryption. Here 
double encryption algorithm is proposed to encrypt 
each partition. The first encryption is done with the 
prime numbers and its complements. The pseudo 
plaintexts or cipher texts are obtained by the 
second encryption with Invertible Non-linear 
Function (INF) and its general form is given 
as, 

g-(x) = ax + b 

where a and b are integers and y denotes 
cipher texts obtained through the first 
encryption. Each cipher part is multiplied with 
a and then added with b. These ciphers are 
stored at different locations of a single cloud 
or multiple clouds. 


a. Double Encryption Algorithm 

Input : X. 

Method: 

i. Random partitioning 

X i ={x l ,x 2 ,x 3 —x n } 

ii. Generate P N and P N = 2X, 

iii. Take P j ,Q i 

iv. Compute P‘, O. 

[■: K c t = 2 /,+l - KJ 

v. Generate Z) and split it into 
small integers 

Dj =d x ,d 2 ,---,d n 

vi. d n = x n (Here, we take n = 4) 

vii. First encrypted data 
yi =Oi *P*Q) + d l 
y 2 =(x l *P*Q c ) + d 2 
y 3 =( Xl *P c *Q) + d 3 
y 4 =( Xl *P c *Q c ) + d 4 

viii. Second encrypted with INF 
g(y l ) = ay l +b 
g{y 2 ) = ay 2 +b 

g(y 3 ) = ay 3 +b 

g(y 4 ) = ay 4 +b 

ix. Apply the above steps on each 
data part (up to x n ) and store the 

obtained multiple ciphers in 
different locations of a single 
cloud storage or different cloud 
storages. 
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Figure 3: Flow diagram of encryption 
algorithm 

4.4. Decryption 

In most of the cryptographic techniques, 
decryption keys are included in the encrypted 
data stored in cloud storage. But in our proposed 
crypto technique the data owner has the 
decryption keys which are given to the 
requesting authorized data users by the data 
owner with an authentication certificate. After 
verifying this certificate the CSP command the 
storage to provide cipher text or encrypted data. 
With the knowledge of inverse INF and using 
decryption key the user decrypt the encrypted 
data. The general form of inverse INF is as 
follows, 


-i ( , x~b 

g 0) =- 

a 

where a and b are integers and x is the cipher 
text. 

a. Decryption Algorithm 

Step 1: Apply inverse form of INF 

G, =*-'<») = ^ 

a 

Step 2: Add all first encrypted ciphers 
= fi+ /a 

Step 3: Subtract large integer 

Z. = S i —D i ( Z. with padded zeros ) 


Step 3: Delete the padded zeros 

Step 4: Perform the above steps on all the four 
encrypted data parts(up to x n ) and sum all of 
them. 


R i=Yj x i 

i =1 

Step 4: Convert into byte array 
Step 5: Merge all Byte arrays 
Step 6: Get original plaintext (X.) 



Figure 4: Flow diagram of decryption 
algorithm 

5. Performance Analysis 

In this section, the performance of the proposed 
DESD crypto technique is analyzed and 
compared with existing techniques in a detailed 
manner. As we know that the privacy and 
security are the most important concerns in 
cloud computing. All existing cryptographic 
techniques tried to provide privacy and security 
to the cloud storage at its level best. There are 
thousands of cryptographic techniques proposed 
previously and we cannot take all of them for 
comparison. So we take two standard 
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cryptographic techniques among them for 
comparison such as AES and Triple DES. 

Consider a situation that an intruder gathers data 
from cloud by breaking protection mechanism of 
CSP. We compare the performance of existing 
and proposed techniques at this situation. 

First we take existing AES and Triple DES 
techniques which provide complete control on 
data to CSP. So the intruder can easily break the 
gathered encrypted files since they are encrypted 
with decryption keys. Moreover the existing 
techniques cannot give cent percentage privacy 
assurance where the data encryption is done by 
the CSP. If the CSP is self-interested on its data 
it can misuse the data without knowing the data 
owner. 

In our proposed technique, the data owner has 
the complete control on data by keeping the 
decryption key with him/her and they store 
encrypted file only at the cloud storage. Without 
the knowledge of inverse INF and decryption 
key the intruder cannot decrypt the file and 
retrieve the data. Therefore it provides complete 
access control on user data. The self-interest of 
CSP on data comes under the control policy 
which is the most significant issue in cloud 
environment. Here the only task of CSP is to 
allocate storage space for data and it never 
involves in data partitioning and encryption. 
Thus this self-interest cannot affect the cloud 
data. From this we can summarize that the 
proposed DESD crypto technique is much 
secure and provide better privacy to cloud users. 

6. Result and Discussion 

The experiment is conducted using Intel(R) 
Core(TM)2 Duo CPU processor with 4 GB 
RAM and on Windows 7 platform. The 
experiment was implemented using Java 
programming. In order to prove the efficiency of 
the proposed crypto technique it is compared 
with some other existing cryptographic 


techniques. The proposed DESD technique is 
compared with AES and Triple DES techniques. 
In our implementation we employed same size 
of input files and examined the performance of 
all three techniques. Here the encryption time 
and decryption time is compared against file 
size. 


Figure 5: File size (MB) vs. Encryption Time (sec) 

In Figure 5, the encryption time of each file size 
is plotted for AES, Triple DES and proposed 
DESD technique. Generally the complex 
operations required more time to process the 
data. But the operations of both encryption 
techniques in our proposed crypto technique are 
simple and easy to process the data. So the time 
to encrypt different file size is reduced when 
compared to other techniques. From the graph it 
is clearly shows that the proposed DESD 
technique possesses less encryption time than 
AES and Triple DES. 



28 https://sites.google.com/site/ijcsis/ 

ISSN 1947-5500 


















International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 



Figure 6: File size (MB) vs. Decryption Time (sec) 

Figure 6 demonstrates the comparison plot of file size 
(MB) and decryption time (sec) for proposed and 
existing techniques. This uses the same file size that 
of encryption. The time required to decrypt the 
encrypted data is known as decryption time. As we 
mentioned above the proposed technique is simple 
i.e. it required simple mathematical operations to 
encrypt data files. From that we can know that the 
decryption is also a simple process. Moreover a 
single decryption is enough to decrypt the encrypted 
data which is encrypted for a couple of time. Thus the 
decryption requires very less time. From figure 6, it 
can be observed that the proposed decryption requires 
very less time compared to AES and Triple DES 
techniques. 

From the above two comparisons it is proved that our 
proposed DESD technique is efficient and faster by 
its simple operation and it is much secure because it 
never enclose the decryption key with the encrypted 
data. Also the data user has to possess knowledge on 
inverse INF and he/she must communicate to the data 
owner to get the decryption key. Hence the data is 
protected against intruders, unauthorized users and 
self-interest of CSP. 

7. Conclusion 


proposed a novel Double Encryption with Single 
Decryption (DESD) crypto technique for secure data 
storage in cloud. Data partitioning is done to make 
the storage easy and effective which also provides 
flexible data access with less storage cost. Then 
double encryption is performed on each partitioned 
data which includes two encryptions namely 
encryption with prime numbers, as well as its 
complements and then with an INF encryption. Using 
the proposed decryption algorithm the obtained data 
can be decrypted by the user. The major benefit of 
this proposed technique is, the encryption is done by 
the data owner and the encrypted data is only stored 
at the cloud storage with the help of CSP. The 
authorized users have knowledge on inverse INF 
which is another important factor for decryption. 
Thus the intruders and third parties aren’t able to 
retrieve and misuse the cloud data without knowledge 
on inverse INF and decryption key. In experimental 
section the proposed technique is compared with 
AES and Triple DES techniques. The performance 
analysis is done using some parameters such 
encryption time and decryption time against file size. 
From the Figure 5 & 6, it is clearly observed that our 
proposed crypto technique is efficient and faster in 
terms of reduced encryption and decryption time 
compared to other techniques. In future the proposed 
DESD crypto technique will be used to encrypt video 
files. 
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Abstract- This paper presents an Applying 
new Genetic approach, which is Volvox 
reproduction(VR) algorithm using the 
natural concept. Then design a hybrid 
technique(HVR), in order to generated a 
wide colony of positive integer numbers, 
which used to encryption(decryption) of 
English text with different sizes. This 
suggested using shifting of places of letters. 
Where the HVR process was built using 
Matlab, which attend 100% success. 

Keywords-Volvox Reproduction 

Algorithm(VR); HybridVR Algorithm(HVR); 
Genetic Algorithm; Volvox Colony; 
Encryption; Decryption. 

I.INTRODUCTION 

In addition to the technological 
development that is taking place in our time, 
the development of mobile phones and the 
emergence of many of the services offered by 
us, the most common and the most used SMS 
text messaging, Which has become a means of 
transmitting information and personal data 
task (bank account data, secret army 
information, user information for some 
programs The task of the login code and 
password). Because this information may be 
stolen and hacked. The purpose of the research 
is to build an application that works on mobile 
devices supporting the Android platform, 
which will benefit from the SMS service 
provided by mobile devices to send user data 
securely such as access information to bank 
accounts and payment cards or any 
confidential and important information we 
would like to send and to convert it into a set 


of symbols, letters and numbers that are not 
understood by suggesting a new encryption 
algorithm based on the phone number of the 
sender and receiver in the secret key between 
the two parties. 

II. BACKGROUND 
Encryption is defined as the process of 
converting clear information into unintelligible 
information to prevent unauthorized persons 
from accessing or understanding information. 
Encryption, therefore, involves converting 
plain text into encrypted text. The 
confidentiality of information is maintained by 
means of methods or algorithms that have the 
ability to convert that information into a 
mixture of symbols, numbers and 
unintelligible characters and then transfer them 
via means of transport to the sender to 
rephrase them to their understandable form 
again " [5],[6],[8]" 

The components of the encryption 
system: Plain Text, Cipher Text, Key. 
Encryption has become very important 
especially since the beginning of the twentieth 
century in the mid-seventies used encryption 
in communications and correspondence 
military and diplomatic and security and 
reached the areas and applications and other 
uses, including: 

1. In industry and commerce. 

2. In video broadcasting. 

3. In hanks. 

4. In computer networks and personal computers. 

5. In the protection of telecommunications from 
the capture and eavesdropping and knowledge of 
the secrets of others. 
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Strong encryption system must achieve the 
following objectives and characteristics: [7] 


*. Privacy or Confidentiality. 
*. Data Integrity. 

*. Entity Authentication. 

*. Message Authentication. 


*. Certification. 
*.Time Stamping. 
*. Witness. 

*. Receipt. 

*. Confirmation. 
*. Ownership. 

*. Anonymity. 

*. Revocation. 


*. Signature. 

*. Authorization. 

*. Validation. 

*. Access Control. 

Modem key management systems 
Adopts two fundamental structures, the 
symmetrical encryption(SE), and 

asymmetrical encryption(AE): 


Where SE is these method that encrypt the 
data which be sent between two entities 
depending on a single key. For this aim the 
entities firstly, will be agreement on a key, 
which later is used to encrypt and decrypt the 
data. But the asymmetrical Encryption is these 
method that encrypt the data which be sent 
between two entities depending on a multi 
key, as following: 

first key which used to encrypt data, 
second key which is used to decrypt data. 

The genetic algorithm(GA) is defined 
as an artificial Intelligent technique that can be 
used to solve and modified Difficult problems. 

GA considered as numerical 
optimization algorithms taken from the 
concept of natural selection and genetics. 

The GA can be applied in wide and 
multiple fields 

The genetic algorithm is successfully 
applied to find an acceptable (near to ideal) 
solution in matters Related to science, 
including medical and engineering sciences, as 
they have greatly reduced time and effort 
Required by system and software 
designers [10] 

Genetic algorithm is used in the 
production of new generations in which have 
characteristics identical to the original or 
better ones, Sexual and non-sexual 
reproduction is also used in the formation of 
new individuals as well. 


III. LITERATURE REVIEW 
There are many studies in this area, 
where all that studies explained the 
reviewing of techniques that was 

implemented and analyzed, the most of 
which is concerned with algorithms of 
encryption a message in the cell 

phone"[1],[2], [3],[4]". 

IV. THE PROPOSED METHOD 
In general this work adopted the idea of 
the natural reproduction of Volvox algae to 
build a new hybrid algorithm(HVR), which is 
applied in security of data. Starting with the 
phone numbers of the sending person and the 
receiver in the encryption and decryption of 
text messages. The secret key is generated 
based on these numbers after a series of 
mathematical operations on the two numbers, 
so that if another phone number is different 
from the phone number of the receiving party, 
the message resulting from the decryption 
process will be completely different from the 
original message. 

V. THE NATURAL 
REPRESENTATION 
Algae is one of the oldest and most 
important living organisms have been found 
fossils millions of years ago, and these algae 
still amaze humans with great benefits and 
new discoveries that can be performed, their 
forms and functions and different types and 
characteristics of each of them and their 
livelihood and importance and how to identify 
them and each of these aspects is fertile field 
For research and reflection. Because algae are 
of great environmental importance as they are 
the primary product and the first link in the 
food chains in water, the two methods of 
Reproduction, which are Eucalyptus 
reproduction, Sexual reproduction. 

In this area, the concept was adopted 
Volvox's algae in this field, because The 
characteristic of algae is that they are one or 
double dwelling, as well as the characteristics 
listed in the following paragraphs. 
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Volcanic algae is one of the species of 
single-cell algae, the most advanced in the 
series of species that form spherical colonies, 
"see figure(l)". 

Each adult colony of the Vol vox 
consists of a huge number of whip cells, 
ranging from 500 to 60,000 cells [9]. 

The Classifications of Volvox 
according to the International Classification of 
Plants which are: 

* Kingdom: ^Division: *Claccl: 

Algae chlorophycophyta chlorophyceae 

*Order: ^Family: *Genus: 

volvocales Volvocaceae Volvox 


Figure 1, Volvox colony 

The reproduction (Sexual reproduction), 
starting when the contents of the antheridia( 
sexual male, which are the anniversary 
container) are divided into a large number of 
male saplings. The male swimmers are 
released from the anhydrides and swim in the 
water until they reach the ( sexual female) in 
the egg to get the fertilized egg(Zygote). One 
male spore injects them to form the ulcer. The 
ulcers then secrete themselves as a thick 
membrane to resist inappropriate conditions. 
When conditions improve, they begin to divide 
the muse, followed by several simple splits to 
form a new colon colony, "see Figure(2)" 



Figure 2 , The Volvox Reproduction 


VI. The Mathematical representation 

The proposed method explained by that 
process of VR algorithm to generated a wide 
colony of integer numbers, which will be used 
as the public key to encryption (or decryption) 
a given text, see Figure(3) and Figure (4). 

First: The VR Algorithm 

This paragraph describe a new GA, 
which is a Volvox Reproduction(VR) 
algorithm, in general has the following 
terminology: 

*. Where the antherizods(A) represents the male 
reproductive segment(First integer number). 

*. The oogonium(B) represents the female 
reproductive segment(Second integer number), 
growing in size and turning into a female quail 
called the egg. 

*. Combine mature (A) with (B) mature to 
recombination operator (crossover), which called a 
zygot(Z) at the first stage of nuclear fusion 
phase(recombination operator). Second stage turn 
the zygote to fertilized egg(FE), which is 
equivalent (The first arithmetic operations). 

Finally, (Cleavage stage or multiple 
nuclear divisions) The fertilized egg (FE)is 
divided several times in succession(Frequent 
beating process), in order to form the new 
colony(NC)( Public key) 

Second The HVR Algorithm 

The Hybrid Volvox Reproduction 
(HVR) Technique, explained by steps explain 
that process of VR algorithm to generated a 
wide colony of integer numbers, which is used 
as public key to encryption or decryption the 
given text. 

The following steps are explaining that 
VR algorithm for generating the public 
key(NC) for encryption(decryption), for any 
given text, with any size. 

Starting the algorithm with the 
following inputs: a, b, d, T, q = 3, 
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Step(l): Find the size ofthe text. 

Input: T 

Output: L, NO. satisfying: 

Lq l = 2 < t.(2.i + !) + l, 

Rq i = 2 c >.(2.i + q) 

( \ 
j if L <40 then set NO=2*q, 

else if L ie \Lq it Rq^ then set NO=2*i+l 

\/i,i=l,2,...,N > 

Stev(2): Growth A and B as following: 

Input: a, b, d 
Output: SN, RN 
A = (ay, 

B — (b) 9 ,g is + ive and integer No. 
greater than or equal 12, and Cut d of digits 
from A(send number) and B(receive number), 

i=0; 

Do£ "The Reproduction process" 
i++; 

Steg[3): Applying the shifting condition. 
Satisfying Cheek condition of digits of A & B, as 
following, Vi, i = 1,..., d: 
if (A(i) — 0 -* setA(i) = 9 )or (A(i) > 10 
-* set A(i) = A(i) - 9 ) 
if (B(i) — 0 -» set B(i) — 9 )or (B(i) > 10 
-> set B(i) — B(i) — 9), 
and Reverse B ,Select antheridium, say that SN= 
A, augonium, say that RN=B, "see table 1". 
Sfenf4): Initial Fertilization step. 

Input: SN, RN 
Output: Z 

Vi, i = 1,..., d, Z(i) = SN(i) * RN(i) 
Stey(5): Second Fertilization step to get the 

Zveote, which is denoted^ by public ke\(PK). 

Input: Z, NO 
Output: NC 

Zigote(i) = (Z(i)) N ° * SN(i), Vi, 
i = 1,2.N 

NC = (Zigote(l),Zigote(2), ...Zigote(N)) 
} W hile(Lje[L qi , Rqj]) 

Step(6): Cypher operation depending on the 

ASCII_ code. 

Input: T, NC. 

Output: NT. 

Cypher operation depending on the 
mathematical condition on NC as following: 

either: NT(i ) = ASCIl(T(i)) - NC(i ) 
or: NT(i) = ASCIl{T(i )) + NC(i) 


Note That: Ending the algorithm with the 
new text(NT). If NT is plain text then this 
process is Encryption process, else will be 
Decryption process, "see Figure (4)". 



Figure 3. The VR Algorithm 



VII. RESULTS 

The proposed method was applied to 
Table 1, contain Range of text (Li) and 
positive & integer number (NO), that will be 
using to apply the HVR algorithm for input 
text (T). 


Table 1, Rang of Lj and NO. 


Lf 

NO: 

Lf 

NO: 

01- 040 

6 

121 -136 

30 

041-056 

10 

137 -152 

34 

057 - 072 

14 

153 -168 

38 

073 - 088 

18 

169 -184 

42 

089 - 104 

22 

• • • 

• • • 

105 -120 

26 

... 

... 
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Different sizes. The time of execution 
was measured by (mm). The program Was 
written by using MATLAB, to programming 
the scalars that measuring the Success of the 
HVR method. Time is the first scalar, which 
was Using to describe the results of Table(2). 

The figures (5) and (6) Respectively 
are shown that, where the Symbols as 
following: 

TE : Time of Encryption of text 
TD : Time of Decryption of text 
TT : Total Time of text 
TTE: Total Time of execution the Encryption 
program 

TTE: Total Time of execution the Decryption 
program 

TT : Total Time of execution the Encryption 
Decryption program 


-4— IF -B-Tf) -*-• , , , 

-r 140 

i on 


1ZU 


■ 100 

l 

80 

K /fix 

60 


40 

1-20 


h 

_ n 

30 29 28 27 26 25 24 23 22 21 201918 17 16 15 14 13 12 1110 9 8 7 6 5 4 3 2 1 


Figure 5, Time of Encryption and Decryption 



Figure 6, Total Time of Encryption and 
Decryption 


Table 2, Time of Encryption, Decryption and Total 


U 

Time of partial 
execution 

Time for Total execution 

TE 

TD 

TS 

TTE 

TTD 

TTS 

1 

46 

14 

60 

21392 

46610 

68002 

2 

52 

79 

131 

44524 

38007 

82531 

3 

44 

16 

60 

46434 

51125 

97559 

20 

28 

14 

42 

100165 

31288 

131453 

40 

15 

15 

30 

149995 

22358 

172353 

41 

16 

15 

31 

196668 

18582 

215250 

49 

17 

23 

40 

177521 

23810 

201331 

57 

16 

27 

43 

475713 

31864 

507577 

65 

17 

15 

32 

307414 

29136 

336550 

72 

19 

19 

38 

1378448 

30262 

1408710 

73 

16 

17 

33 

236960 

35417 

272377 

86 

17 

16 

33 

299536 

23646 

323182 

88 

17 

16 

33 

305910 

21281 

327191 

89 

15 

17 

32 

331250 

34811 

366061 

97 

16 

17 

33 

349200 

22990 

372190 

104 

14 

16 

30 

369154 

22412 

391566 

105 

20 

19 

39 

391805 

18491 

410296 

113 

20 

16 

36 

976871 

43977 

1020848 

120 

24 

23 

47 

450947 

25980 

476927 

121 

18 

25 

43 

595526 

26526 

622052 

129 

20 

17 

37 

759813 

23930 

783743 

136 

26 

17 

43 

515506 

26171 

541677 

137 

17 

19 

36 

623842 

29859 

653701 

145 

17 

17 

34 

595119 

26299 

621418 

152 

21 

18 

39 

738931 

23263 

762194 

155 

18 

19 

37 

1381600 

20194 

1401794 

161 

18 

26 

44 

778209 

20636 

798845 

169 

23 

18 

41 

577098 

22735 

599833 

177 

17 

17 

34 

49737 

27992 

77729 

184 

24 

19 

43 

663831 

19800 

683631 


Table (3) is the summarized of Table 
(2), and the figure(7), where Lj denoted to 
length of text, 
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Table 3, Time of Encryption, 


Decryption and Total 


Li 

TE 

TD 

TS 

001-040 

37.00 

36.00 

73.00 

041-056 

16.50 

19.00 

35.5 

057-072 

17.33 

20.33 

37.66 

073-088 

16.67 

16.33 

33.00 

089-104 

15.00 

16.67 

31.67 

105-120 

21.33 

19.33 

40.66 

121-136 

22.00 

19.67 

41.67 

137-152 

18.33 

18.00 

36.33 

153-168 

18.00 

22.50 

40.50 

169-184 

21.33 

18.00 

39.33 



Figure 7, Time of Encryption and Decryption 


Table(4) shown the Efficiency Scalar 
of Encryption(EEE) and Decryption(EED) of 
the proposed method, depending on time are 
approximately is equivalent. 


Table 4, Efficiency of HVR algorithm 


Li 

Time 

Efficiency 

TE 

TD 

TT 

EEE 

EED 

001-040 

37.00 

36.00 

73.00 

4E-04 

3E-04 

041-056 

16.50 

19.00 

35.5 

2E-04 

IE-04 

057-072 

17.33 

20.33 

37.66 

2E-04 

2E-04 

073-088 

16.67 

16.33 

33.00 

2E-04 

IE-04 

089-104 

15.00 

16.67 

31.67 

2E-04 

IE-04 

105-120 

21.33 

19.33 

40.66 

IE-04 

IE-04 

121-136 

22.00 

19.67 

41.67 

IE-04 

IE-04 

137-152 

18.33 

18.00 

36.33 

IE-04 

IE-04 

153-168 

18.00 

22.50 

40.50 

IE-04 

9E-05 

169-184 

21.33 

18.00 

39.33 

IE-04 

IE-04 


From table 5, note that the fitting 
ratio(equation (1)) is approximately (100%) 
between cyphering and deciphering. Where 
the ratio calculated between the input 
text(T: t } G T, Vt), and the resulting text 
(T:t)GT, Vi) from applying cyphering and 
deciphering on that input text. 


fitt = 100 * 


(ft - tp 2 ; 

Ef=i(ti - ™t) 2 


( 1 ) 


where : mt = mean(T ) 


Table 5, Fitting Ratio Scalar 


Li 

Fitting 

ratio 

Li 

Fitting ratio 

1-40 

100.000% 

121 -136 

99.999% 

41-56 

99.900% 

137 -152 

99. 998% 

57- 72 

99.999% 

153 -168 

100.000% 

73-88 

99.000% 

169 -184 

99.989% 

89- 104 

100.000% 

... 

... 

105 -120 

99.998% 

... 

... 
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VIII. CONCLUSIONS 

*. A New Genetic algorithm has been 
proposed in the process of text 
encryption(decryption). 

*. The algorithm has been applied in the 
encryption(decryption) for texts of 
different sizes. 

*. Several measures have been introduced to 
measure the efficiency of the proposed 
algorithm, including: Time, Efficiency 
and Fitting Ratio. 

*. The Scales were applied between the 
process encryption(decryption) of texts. 

*. The algorithm has proven successful in 
encryption(decryption) by up to 100%. 

*. The encryption method was hybridized 
using a proposed genetic algorithm. 

*. Mathematical concepts were used for the 
proposed method 

*. The proposed algorithm can be applied to 
the mobile rather than to the PC. 
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Integration on Radial Distribution Feeder 


Mohsin A. Mari 

Department of Electrical Engineering 
Mehran UET SZAB Campus Khairpur Mir's 
E.mail: mohsinali@muetkhp.edu.pk 
M. Zubair Bhayo 
Department of Electrical Technology 
Benazir Bhutto Shaheed University of Technology and Skill 
Development Khairpur Mir's 
E.mail: muhammadzubairbhayo@gmail.com 

Abstract —Human population of the world and its Electrical 
power demand is increasing day by day. The available fossil 
fuel energy resources are being depleted day by day. So it is a 
wise decision to absorb the natural renewable energy 
resources. Among the other natural resources, solar energy is 
also a precious available energy source. In Pakistan abundance 
solar energy can be easily extracted. 

In this research work, impacts of solar generation system are 
analyzed while integrated with llkV radial distribution feeder. 

PV system is integrated with feeder in three different ways by 
using SINCAL software and its impacts in terms of the power 
loss, voltage profile and short circuit level are analyzed. When 
PV system is integrated with HT side it results negligible 
increment in voltage, no change in LT losses, negligible 
decrement in HT losses and no change in short circuit level. 
When PV system is connected with LT bus-bar of each 
transformer, there is significant increment in voltage, small 
decrement in LT losses, significant decrement in HT losses and 
smaller increment in short circuit level. When PV system is 
connected with each load, there is significant increment in 
voltage, large decrement in LT losses, significant decrement in 
HT losses and smaller increment in short circuit level. 

Keywords: Solar PV system; voltage; short circuit level; power 
losses; distribution feeder. 

1. INTRODUCTION 

Like other developing countries, Pakistan is also 
facing a critical energy shortage crisis, due to rapid 
population growth and unsatisfactory available 
resources[ll-17]. The situation is further worsened 
because of technical and non-technical losses 
occurring in the existing system[6]. Presently the 
main resources of power generation in Pakistan are 
fossil fuels including oil and gas which have no 
guaranty to be continued for future increased 
demands. The other major factor of energy wastage is 
the large distance between generation and utilization. 
The generation plants are far away from the loads, so 
the power is carried out through long transmission 
lines with high voltages. Power reaches to the 
consumers travelling through primary transmission 
500 KV or 220KV, secondary transmission 132 KV 
or 66 KV then primary distribution 11 KV and 
secondary distribution 380 V or 230 V and finally 
through service mains. This whole network offers 
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various problems and losses to the Electrical Power 

[ 10 ]. 

If the distance between generation and distribution is 
reduced by certain means, the power losses will be 
greatly reduced and we can save our power [5]. This 
will not only reduce the gap between generation and 
demand but in future we can fulfill our 
requirement. The best alternate to this scenario is to 
install and integrate the distribution generations 
locally near the end consumers[9]. One of the simple 
and cheap sources of the distribution generation is the 
solar power. If the smaller solar PV arrays are 
designed and connected in series and parallel 
combination they can generate the required power at 
required DC voltage which is then converted into AC 
voltage using inverters [3]. This system may feed the 
required electrical energy to a home or town locally 
owing to purchase the costly energy from the 
WAPDA. The second great benefit is that the 
consumer’s requirement from national grid will be 
reduced. So when power will not travel through long 
distances, the power losses will be greatly reduced 
[ 8 ], 

Besides the many advantages of renewable solar 
energy there are some positive and negative impacts 
of such systems when integrated with the existing 
working system [7]. These impacts depend upon the 
size of the distribution generation, techniques of 
integration, location of integration and the design of 
the existing system. The main parameters which will 
be affected are; voltage of the system, power losses 
and short circuit level of the feeder. To analyze these 
impacts an existing 11 KV feeder of HESCO named 
Sachal feeder is taken into consideration and it is 
simulated using the SINCAL software. First of all 
normal feeder is simulated and its load flow and short 
circuit calculations are taken without any solar PV 
integration. Then three different cases are simulated. 
In first case three solar generators each of equal to 1 
MW rating are installed at three equally spaced 
locations. In second case one solar generator of half 
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of the total load capacity of transformer is integrated 
with the LT bus bar of the transformer. In third case 
one solar generator of half of the load demand is 
connected with each load. Finally the results are 
compared for voltage, power losses and short circuit 
current level. 

Section 2 discusses the solar technology as a 
distribution generation. Selected feeder and its detail 
are given in section3. Simulation results are 
discussed in section 4. Impacts of solar PV 
integration are given in tables and graphs in terms of 
the power loss and short circuit level. Finally the 
conclusion of paper is given in section 5. 

2. SOLAR PHOTO VOLTAIC GENERATION AND 
INTEGRATION 

In 1839 a French Physist named Becquerel 
discovered the photo voltaic effect. Up to 1954[7] it 
remained the constrained of the laboratories. Then 
Bell laboratories produced the first primary silicon 
cell which was the quickly improved and used in the 
united country space program. Solar photovoltaic 
cells convert light energy directly into Electrical 
Energy. The generated output voltage of a cell is very 
small so to increase this voltage, these cells are 
fabricated in the shape of an array. Each array is 
called a solar panal and can generate a voltage of 12 
V or 24 V DC [4]. To increase this voltage further, 
these arrays are connected in series with each other, 
because in series voltage is added. To increase their 
power rating, these arrays are further connected in 
parallel. Each small array can provide a power of 35 
W, 50 W, 100 W and 150 W. Since this energy 
totally relies on the presence of sun so naturally 
variation will occur in the output of these cells. In 
order to compensate this variation DC Voltage can be 
regulated by DC-DC converters. If the load is DC it 
can be feed directly from the output of DC-DC 
converters, or we can use the batteries which are 
charged during the day time and we can run our DC 
load at night time as well from these batteries. To 
feed the AC loads, this generated DC voltage can be 
converted into AC voltage by DC-AC converters 
which are also called inverters. Local AC loads are 
directly feed from the output of these inverters, and 
to integrate this AC voltage into an existing AC 
system transformer is used to step this voltage up 
according to the voltage available in the system [2]. 

Solar PV generators can be designed in plenty of 
the sizes and used for commercial, residential and 
irrigation load demands. They provide real power to 
the loads. They are environment friendly and easy to 
install. They provide a backup support and increase 
the reliability when integrated with an existing 
system. They can help the main grid during the peak 


hours and provide power to consumers when the load 
on a particular feeder is beyond its capacity[l]. 

Large scale solar power generation and integration at 
either low voltage or high voltage is an emerging 
trend now a day. Presently power systems of Pakistan 
are fully overloaded and insufficient for the demand 
so the solar generation is a precious alternate to 
overcome the increased load demands. Fig. 1 shows a 
solar power system in which solar panals are 
arranged in series and parallel combination to fulfill 
the required voltage and power. There is computer 
controlled system to monitor the generation. Inverters 
are shown in a box for converting DC voltage to AC 
voltage which is then feed to a transformer to step it 
up. Finally the output from the transformer is 
connected to the feeder or national grid through three 
phase distribution system. 



Fig. 1. Solar power plant. 


Different countries are offering financial incentives 
to their residential consumers to install solar power 
systems to fulfill their local load demands. When a 
PV solar generation system is integrated with 
distribution system the power flow and impedance 
matrices of the system are changed depending upon 
the size and configuration of the system. So it is 
important to analyze the impacts of this distribution 
generation on the radial feeder of HESCO. 

3. SYSTEM DESCRIPTION 

In this research work, llkV Sachal feeder is selected 
to observe impacts of solar PV system in terms of 
voltage, power loss and short circuit current level. 
Around 3000 residential and commercial consumers 
are supplied through 41 Pole Mounted Transformers 
(PMT) having total capacity of 7600 kVA. Dog and 
Rabbit conductors are used for high tension (H.T) 
network of 7.42 km. 

The entire required feeder data is collected from 
HESCO and then the required model is designed and 
simulated using PSS SINCAL software. Separate 
models are developed for all L.T circuits and linked 
with H.T circuit for simulation purposes. Four 
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different cases are analyzed using different 
simulation circuits. The description of all the cases is 
listed in Table 1. 


TABLE. 1 CASES FOR SYSTEM SIMULATION 


Case # 

Description 

I 

Existing Network without any solar system 

II 

Three solar PV systems connected with HT feeder 

III 

41 separate PV systems are connected with 
transformer LT side 

IV 

Each load point is connected with separate PV 
system 


4. RESULTS AND DISCUSSIONS 
Simulation results for four mentioned cases are 
analyzed to observe the positive or negative impacts 
of solar generation on power losses, short circuit 
level and voltage on radial distribution feeder. 

Table. 2 shows comparison of voltages for some H.T 
buses in system for analyzed four cases. It can be 
observed that bus voltages have improved with PV 
integration. This improvement is due to reduction in 
current flow as loads are supplied by nearby PV 
source. Maximum improvement in voltages is 
observed for Case IV as PV generation is connected 
nearest to load. 


TABLE.2 H.T BUS VOLTAGE COMPARISON 


Bus 

Voltage (kV) | 

Without 

PV 

PV with 
H.T 

PV at 

transformer 

secondary 

PV with 
Load 

1 

10.993 

10.993 

10.995 

10.996 

2 

10.993 

10.994 

10.995 

10.995 

3 

10.846 

10.845 

10.893 

10.901 

4 

10.968 

10.970 

10.977 

10.980 

5 

10.963 

10.968 

10.976 

10.977 

6 

10.961 

10.966 

10.972 

10.976 

7 

10.960 

10.964 

10.972 

10.975 

8 

10.959 

10.963 

10.971 

10.974 

9 

10.954 

10.959 

10.967 

10.972 

10 

10.953 

10.958 

10.966 

10.971 

11 

10.952 

10.958 

10.968 

10.970 

12 

10.951 

10.955 

10.965 

10.970 

13 

10.948 

10.954 

10.966 

10.968 

14 

10.942 

10.951 

10.962 

10.964 

15 

10.939 

10.946 

10.960 

10.962 


Short circuit level in MVA is the product of fault 
current and system rated voltage. It is important 
consideration for system design as circuit breakers 
are rated according to calculated short circuit level. 
Short circuit level depends upon the system 
impedance and configuration. When a solar PV 
generation system is connected to an existing system, 
the impedance of system changes which can cause an 
increase in the short circuit level. This increase in 
short circuit level will make it necessary to enhance 
circuit breaker capacities or incorporate current 
limiting resistors. Table. 3compares short circuit 
level for some of the H.T buses for four analyzed 


cases. It is evident that short circuit level has 
increased for all buses but increscent is not 
significant and therefore same circuit breakers are 
sufficient. 


TABLE.3 H.T. BUS SHORT CIRCUIT LEVEL COMPARISON 


Bus 

Short circuit Level (MVA) | 

Without 

PV 

PV with 
H.T 

PV at 

transformer 

secondary 

PV with 
Load 

1 

992.133 

992.139 

992.426 

992.507 

2 

864.708 

864.728 

864.972 

865.047 

3 

622.170 

622.235 

622.677 

622.846 

4 

558.005 

558.107 

558.539 

558.719 

5 

529.146 

529.154 

529.661 

529.841 

6 

492.426 

492.443 

492.915 

493.094 

7 

345.357 

345.378 

345.733 

345.895 

8 

373.434 

373.512 

373.848 

373.905 

9 

319.995 

320.016 

320.350 

320.505 

10 

275.644 

275.696 

275.962 

276.099 

11 

264.743 

264.859 

265.050 

265.182 

12 

258.943 

259.012 

259.242 

259.370 

13 

250.076 

250.125 

250.365 

250.489 

14 

249.262 

249.413 

249.550 

249.675 

15 

165.860 

165.984 

166.445 

166.589 


PV integration to power system will result in change 
in system losses as current flows are changed. 
Table.4gives comparison of branch power losses for 
some of the H.T buses and Table.5gives comparison 
of power losses for L.T circuits of all 41 
transformers. Power losses in most of the branches on 
H.T circuits are slightly reduced. Similarly all L.T 
circuits have reduction in power losses. Power loss 
reduction for all branches is observed when PV is 
integrated nearest to loads. 


TABLE.4 H.T. BRANCH POWER LOSS COMPARISON 


Line 

Power Losses (kW) 


Without 

PV 

PV at 

PV 


PV 

with 

transformer 

with 



H.T 

secondary 

Load 

LI-4 

0.45 

0.41 

0.23 

0.21 

L4-8 

2.60 

2.53 

1.32 

1.22 

L8-9 

0.39 

0.34 

0.20 

0.18 

L9-10 

0.22 

0.22 

0.11 

0.10 

L10-11 

0.29 

0.27 

0.14 

0.13 

LI 1-12 

0.22 

0.22 

0.11 

0.10 

L12-15 

0.01 

0.01 

0.00 

0.00 

Ll-1/1/3 

5.01 

5.00 

2.61 

2.21 

Ll/1/3- 

1/1/4 

1.28 

1.25 

0.66 

0.57 

Ll/1/4- 

1/1/4/15 

16.64 

16.58 

8.18 

7.66 

Ll/1/4- 

1/1/5 

0.04 

0.04 

0.03 

0.01 

Ll/1/5- 

1/1/6 

0.05 

0.04 

0.03 

0.02 

Ll/1/6- 

1/1/9 

0.20 

0.20 

0.13 

0.08 

L12/5- 

12/7 

0.05 

0.04 

0.02 

0.02 
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TABLE.5.L.T.CIRCUIT POWER LOSS COMPARISON 


Transformer 

Power Losses (kW) 


Without 

PV 

PV at 

PV 


PV 

with 

transformer 

with 



H.T 

secondary 

Load 

T1 

0.065 

0.065 

0.064 

0.031 

T2 

0.410 

0.405 

0.402 

0.162 

T3 

0.138 

0.137 

0.131 

0.06 

T4 

0.071 

0.0705 

0.068 

0.037 

T5 

0.378 

0.377 

0.376 

0.159 

T6 

0.196 

0.195 

0.193 

0.085 

T7 

0.421 

0.420 

0.415 

0.189 

T8 

0.033 

0.033 

0.032 

0.033 

T9 

0.091 

0.090 

0.089 

0.042 

T10 

0.022 

0.020 

0.018 

0.015 

Til 

0.066 

0.065 

0.060 

0.031 

T12 

15.931 

15.81 

14.494 

15.254 

T13 

0.233 

0.230 

0.228 

0.097 

T14 

0.237 

0.234 

0.232 

0.1 

T15 

0.458 

0.454 

0.447 

0.179 

T16 

0.401 

0.398 

0.393 

0.166 

T17 

0.014 

0.013 

0.010 

0.005 

T18 

0.135 

0.132 

0.130 

0.054 

T19 

0.165 

0.163 

0.161 

0.066 

T20 

0.027 

0.026 

0.021 

0.017 

T21 

0.032 

0.030 

0.029 

0.011 

T22 

0.035 

0.033 

0.031 

0.011 

T23 

0.044 

0.042 

0.041 

0.016 

T24 

0.027 

0.026 

0.021 

0.011 

T25 

0.044 

0.040 

0.035 

0.016 

T26 

0.044 

0.041 

0.039 

0.016 

T27 

0.049 

0.044 

0.040 

0.016 

T28 

0.049 

0.042 

0.031 

0.015 

T29 

0.013 

0.013 

0.012 

0.005 

T30 

0.311 

0.310 

0.308 

0.144 

T31 

0.336 

0.331 

0.325 

0.154 

T32 

0.061 

0.060 

0.059 

0.029 

T33 

0.144 

0.141 

0.138 

0.064 

T34 

0.028 

0.025 

0.021 

0.012 

T35 

0.036 

0.031 

0.026 

0.015 

T36 

0.143 

0.138 

0.131 

0.075 

T37 

0.068 

0.064 

0.059 

0.029 

T38 

0.134 

0.127 

0.117 

0.059 

T39 

1.124 

1.119 

1.095 

0.514 

T40 

0.026 

0.021 

0.018 

0.009 

T41 

0.199 

0.194 

0.181 

0.094 


Fig. 2 shows graphical comparison of total power 
losses of the Sachal feeder for four analyzed cases. It 
is observed that there is considerable power loss 
reduction with PV integration to selected feeder. 
Maximum power loss reduction is achieved if PV 
generation is connected near the loads. 


5. CONCLUSIONS 

Solar PV generation is increasing rapidly around the 
world.Large PV power generation facilities are being 
installed by utilities. Quaid e Azam solar park is one 
of the first large solar generations in Pakistan. Small 
PV power generation facilities are also being 
installed around the country. 

In this research work, effects of solar PV generation 
on distribution network are analyzed using simulation 
on PSS-SINCAL software. 11 kV Sachal feeder is 
modelled and simulated after collection of real time 
data. Four different simulations are performed to 
observe effects of different locations 

• Without any PV 

• PV distributed on H.T network 

• PV connected with transformer bus on L.T 

• PV connected with loads representing 
individual PV generation by consumers 

Comparison of results for four simulation cases 
shows that: 

H.T and L.T power losses are reduced with PV 
integration 

• Highest improvement is observed when PV 
generation is connected with loads 

• Lowest is observed with PVs connected on 
H.T network 

• Power and current flows have changed with 
PV integration 

It is therefore concluded that PV integration will 
improve system performance specially those small 
PV generations installed by consumers. 
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Abstract — Big Data is an evolution of Business Intelligence (BI). 
Whereas traditional BI relies on data warehouses limited in size 
(some terabytes) and it hardly manages unstructured data and 
real-time analysis, the era of Big Data opens up a new technologi¬ 
cal period offering advanced architectures and infrastructures 
allowing sophisticated analyzes taking into account these new 
data integrated into the ecosystem of the business . In this article, 
we will present the results of an experimental study on the per¬ 
formance of the best framework of Big Analytics (Spark) with the 
most popular databases of NoSQL MongoDB and Hadoop. The 
objective of this study is to determine the software combination 
that allows sophisticated analysis in real time. 

Keywords- big data analyticsy; NoSQL databases; Apache Spark ; 
Hadoop; MongoDB , performance . 

I. Introduction 

The Big Data phenomenon, for companies, covers two real¬ 
ities: on the one hand this explosion of data continuously, on 
the other hand the capacity to process and analyze this great 
mass of data to make a profit. With Big Data, organizations can 
now manage and process massive data to extract value, decide 
and act in real time. 

NoSQL databases were developed to provide a set of new 
data management features while overcoming some limitations 
of currently used relational databases [1]. NoSQL databases are 
not relational and they don’t require a model or structure for 
data storage, which facilitates the storage and data search. In 
addition, they allow horizontal scalability, it gives administra¬ 
tors the ability of increasing the number of server machines to 
minimize overall system load. The new nodes are integrated 
and operated in an automatic manner by the system. Horizontal 
scalability reduces the response time of queries with a low cost. 

In relation to the NoSQL databases (Hadoop, MongoDB, 
Cassandra, Hbase, Radis, Riak...., etc.), a new profession 
appeared "the data scientist". Data science is the extraction of 
knowledge from data sets [2, 3]. It employs techniques and 
theories derived from several other broader areas of mathe¬ 
matics, mainly statistics, probabilistic models, machine learn¬ 
ing. Thus, to develop algorithms in a distributed environment, 
the analyst must master tools of big data analytics (Mahout, 
MapReduce, Spark and Storm) and learn the syntax of func¬ 
tional languages to use Scala, Erlang or Clojure. 


Big data analytics therefore favors a return to grace of 
functional languages and robust methods: decision tree [4, 5], 
and random forest [6], k-means [7], Naive Bayes classifier [8], 
easily distributable (MapReduce) on thousands of nodes. 

For collected data storage, any NoSQL database can fulfill 
this role. However, the need to analyze this data pushes us to 
choose this database carefully. Especially in the field of Big 
Data, the analytic part becomes more and more important. For 
advanced, real-time analytics, the best framework you can use 
is Apache Spark [9, 10]. According to the official version, 
Spark uses the hadoop HDFS file system. 

In a previous study [11] based on a multicriteria analysis 
method, the MongoDB system obtained the highest score. 
Today, this result was confirmed. This system has become 
popular [12]. According to a white paper [13] published by 
MongoDB, The combination of the fastest analysis engine 
(Spark) with the fastest-growing database (MongoDB) allows 
companies to easily perform reliable real-time analysis. This 
led us to compare Spark's performance against the most popu¬ 
lar NoSQL databases, MongoDB and Hadoop. In this article, 
we will present and discuss the results of our experimental 
study. Thus, we will determine the software combination that 
allows giving sophisticated analyzes in real time. 

This paper is organized as follows: Section II presents big 
data analytics on Hadoop and MongoDB. In section III, we 
present the results of an experimental study on the perfor¬ 
mance of the framework Spark with MongoDB and Hadoop. 
Section IV provides a conclusion. 

II. Big data analytics 

In this part, we will introduce the data analysis technologies 
used on Hadoop and MongoDB. 

A. Big Data Analytics on Hadoop 

The first integrated solution with Hadoop for data analysis 
is the MapReduce framework. MapReduce is not in itself an 
element of databases. This distributed information processing 
approach takes an input list, produces one in return.it can be 
used for many situations; it is well suited for distributed pro¬ 
cessing needs and decision-making processes. 
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MapReduce defined in 2004 in an article written by 
Google. The principle is simple: to distribute a treatment, 
Google imagined a two-step operation. First, an assignment of 
operations on each machine (Map) followed processing by a 
grouping of results (Reduce). The needs of Google that gave 
birth to MapReduce are twofold: how to handle gigantic vol¬ 
umes of unstructured data (web pages to analyze to feed the 
Google search engine, or the analysis of the logs produced by 
the work of its indexing engines, for example), to derive re¬ 
sults from calculations, aggregates, summaries ... in short, 
from the analysis. 

The free reference implementation of MapReduce is called 
Hadoop, a system developed by a team led by Doug Cutting, 
in Java, for the purposes of its Nutch distributed indexing 
engine for Yahoo! Hadoop directly implements the Google 
document on MapReduce, and bases its distributed storage on 
HDFS (Hadoop File System), which implements the Google 
document on GFS (Google File System). Then, the Hadoop 
MapReduce Framework (YARN) implemented by several 
NoSQL databases such as Hbase, Cassandra, etc. 

Then, Facebook developed the HQL language (Hive lan¬ 
guage query) on Hive. Close to SQL to query HDFS. Another 
language, called Pig, developed by Yahoo similar in its syntax 
to Perl and aimed at the goals of Hive. In addition, cloudera, 
another Hadoop distribution, integrates Impala's queries en¬ 
gine. Analysts and data scientists to perform analysis on data 
stored in Hadoop via SQL tools or business intelligence tools 
favor this latest one. The Mahout project provides algorithms 
implementations for business intelligence. It provides, for 
example, machine-learning algorithms (Kmeans, Random 
Forest). 

B. Big Data Analytics on MongoDB 

MongoDB is an open-source document-oriented database 
designed for exceptionally high performance and developed in 
C ++. Data is stored and queried in BSON format similar to 
JSON. It has dynamic and flexible schemas, making data inte¬ 
gration easier and faster than traditional databases. Unlike 
NoSQL databases that offer basic queries. Developers can use 
MongoDB native queries and data mining capabilities to gen¬ 
erate many classes of analysis, before having to adopt dedicat¬ 
ed frameworks such as Spark or MapReduce for more special¬ 
ized tasks. 

Several organizations including McAfee, Salesforce, 
Buzzfeed, Amadeus, KPMG and many others rely on Mon- 
goDB's powerful query language, aggregations and indexing 
to generate real-time analytics directly on their operational 
data. MongoDB users have access to a wide range of queries, 
projection and update operators that support real-time analytic 
queries on operational data: 

• The MongoDB Aggregation Pipeline is similar in 
concept to the SQL GROUP BY statement, enabling 
users to generate aggregations of values returned by 
the query (e.g., count, minimum, maximum, average, 
intersections) that can be used to power analytics 
dashboards and visualizations. 


• Range queries returning results based on values de¬ 
fined as inequalities (e.g., greater than, less than or 
equal to, between) 

• Search queries return results in relevance order and 
in faceted groups, based on text arguments using 
Boolean operators (e.g., AND, OR, NOT), and 
through bucketing, grouping and counting of query 
results. 

• MongoDB provides native support for MapReduce, 
allowing complex JavaScript processing. Multiple 
MapReduce jobs can run simultaneously on the same 
server and on fragmented collections. 

• JOINs , Graph queries , Key-value queries ... 

C. Big Data Analytics on Hadoop 

The MapReduce framework, despite being widely used by 
companies for the analysis of Big Data, the response time is 
not satisfactory and its programs executed only in the form of 
a batch. After a map or reduce operation, the result must be 
written to disk. This disk-written data allows mappers and 
reducers to communicate with each other. It is also the write 
on disk, which allows a certain tolerance to the failures: if a 
map or reduce operation fails, it is enough to read the data 
from the disk to take again, where we were. However, these 
writings and readings are time consuming. In addition, the 
expression set composed exclusively of map and reduce op¬ 
erations is very limited and not very expressive. In other 
words, it is difficult to express complex operations using only 
this set of two operations. 

Apache Spark is an alternative to Hadoop MapReduce for 
distributed computing that aims to solve both of these prob¬ 
lems. The fundamental difference between Hadoop MapRe¬ 
duce and Spark is that Spark writes data in RAM, not on disk. 
This has several important consequences on the speed of cal¬ 
culation processing as well as on the overall architecture of 
Spark. 

Spark offers a complete and unified framework (Figure 1) 
to meet the needs of Big Data processing for various datasets, 
various by their nature (text, graph, etc.) as well as by the type 
of source (batch or real-time flow). It allows to quickly write 
applications in Java, Scala or Python and includes a set of 
more than 80 high-level operators, it is possible to use it inter¬ 
actively to query the data from a shell, in addition to the op¬ 
erations of Map and Reduce, Spark supports SQL queries and 
data streaming and offers machine learning and graph-oriented 
processing functions. Developers can use these possibilities in 
stand-alone or by combining them into a complex processing 
chain. 
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Figure 1: Apache Spark Ecosystem 


Spark's programming model is similar to MapReduce, ex¬ 
cept that Spark introduces a new abstraction called Resilient 
Distributed Datasets (RDDs). Using RDDs, Spark can provide 
solutions for several applications that previously require the 
integration of multiple technologies, including SQL, stream¬ 
ing, machine learning and graph processing. 

A Dataset is a distributed collection of data. It can be 
viewed as a conceptual evolution of RDDs (Resilient Distrib¬ 
uted Datasets), historically the first distributed data structure 
used by Spark. A DataFrame is a Dataset organized into col¬ 
umns that have names, such as tables in a database. With the 
Scala programming interface, the DataFrame type is simply 
the alias of the Dataset [Row] type. 

It is possible to apply actions to the Datasets, which pro¬ 
duce values, and transformations, which produce new Da¬ 
tasets, as well as certain functions that do not fit into either 
category. 


We did the test on one node, three nodes and four nodes, 
The machines used having the following configuration: 

• 8GB RAM 

• Linux Fedora 26 

• 120 GB SSD 

• 6th generation i5 processor 


Table 1: Spark's performance with Hadoop and MongoDB 


Nodes 

File size 
(GB) 

Action 

Hadoop 

MongoDB 

1 

1,55 

first 

count 

96 ms 

77 ms 

10 s 

2,0 min 

3 

3,11 

first 

count 

90 ms 

65 ms 

19 s 

3,4 min 

4 

4,66 

first 

count 

0,1s 

57 s 

29 s 

5,3 min 


//For example to create a Dataset from the text file named LICENSE: 
scala> val texteLicence = spark.read.textFile("LICENSE") 

//An example of action: 

scala> texteLicence.count() // number of lines 

/* We can use a transformation to build a Dataset containing only 
the lines of textLicence which contain "Copyright", to return a 
table with its 2 first lines:*/ 

scala> val lignesAvecCopyright = texteLicence.filter(line => 
line.contains("Copyright")) 

scala> lignesAvecCopyright.take(2) 


Figure 2: Spark Command lines Example 

Spark exposes RDDs through a functional programming 
API in Scala, Java, Python, and R, where users can simply 
pass local functions to run on the cluster. 

III. COMPARISON 

A. The Experiments Results 

We made the comparison on files of the same size and type 
(.CSV). The test files are available on this link 
"https://catalog.data.gov/dataset/crimes-2001 -to-present 
398a4". We copied each file to the Hadoop file system. Then 
the same file imported by MongoDB. 


These results are illustrated in the following figure: 



Figure 3: Comparison of Spark's performance versus Hadoop 
and MongoDB 

B. Results Interpretation 

According to the results of this study, the execution time of 
the first operation that looks for the first record of the file is 
the same on Hadoop or MongoDB, sometimes Spark is faster 
with MongoDB, but the execution time of the operation count 
that requires the change of the entire file in memory in a RDD, 
Spark is much faster with Hadoop. 
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For the moment, Hadoop remains the best global storage 
solution with administration that is more advanced, security 
and monitoring tools. This choice, Oracle did for its brand 
new data discovery and analysis solution, Big Data Discovery. 
The product installs on a Hadoop cluster (exclusively 
Cloudera) and relies heavily on Spark for its treatments. 


IV. Conclusion 

In this article, we presented the results of an experimental 
study on the performance of the best framework of Big Ana¬ 
lytics (Spark) with the most popular databases of NoSQL 
MongoDB and Hadoop. The aim of this study is to determine 
the software combination that allows sophisticated analysis in 
real time. According to the results of this study, Spark is much 
faster with Hadoop. 
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Abstract- This survey reviews the latest literature related to scheduling problems which is closely related to load 
balancing problems. It is noted that they are often used with the same meaning. In fact, it is not efficient to use one 
without the other. This is because the scheduling problem is to determine the order of tasks execution on available 
devices, while load balancing seeks to balance these tasks between these devices. The motivation of this work comes 
from the need to have, in one paper, a comprehensive idea of these problems with an in-depth view of the involved 
research tendencies. Several scheduling schemes under different constraints and optimization criteria are discussed. 
We observed that the rapid technological development at the level of machinery and equipment is accompanied by 
intensive use of these devices. This requires the enhancement and improvement of scheduling algorithms and the 
tendency is more and more towards the heuristic and approximate algorithms. As the scheduling schemes range from 
workshops to Cloud, Fog and Edge computing segments of the collaborative mobile computing, we argue that they 
have not yet been used effectively in its third segment: individual mobile networks. These networks can play the most 
effective role, in catastrophic situations, to overcome the problem of telephony/internet communication traffic with 
the cheapest or free cost. We aim to motivate research on scheduling issues to this segment of collaborative mobile 
computing that becomes indispensable in urgent these cases as: Oregon, floods, earthquake, terrorist attacks, etc., 
when almost everything is damaged or not accessible except our small mobile devices and ubiquitous resources. 

1 INTRODUCTION 

Accelerated technological development constantly imposes real challenges in various fields, especially in 
mobile networks because of the rapid advances in computer architecture, mobile devices and wireless 
communications. This has led to the transition from big devices network (desktops and laptops) to small mobile 
devices connected by high bandwidth wireless. The advances in hardware and software technologies have 
sparked increased interest in the use of these mobile devices within the large scale parallel and distributed 
systems in different fields such as databases, defense, real-time and commercial applications. The performance 
of any system designed to operate a large number of devices depends on the tasks scheduling satisfying the 
workload distribution across these devices. 

Scheduling involves allocating resources over time to perform a collection of tasks [ , 6]. The need of 
scheduling started first in factories and industries before becoming a de facto technique in multiprocessor 
computers. 

Due to the explosion of mobile and wireless technologies, and despite the fourth generation (and soon the 
fifth) deployment of wireless communication systems, several challenges still remain to be solved. These 
challenges include the spectrum crisis, high energy consumption, the ever-increasing demand for high data rates 
and the mobility required by new wireless applications. 

The latest technological advances in this area can be a solution to the intensive use of applications on mobile 
devices. As this situation has captured users, because it gives them freedom in terms of place of work and time 
to pursue their jobs and interests. This increases the demand for using these devices. Unfortunately, the rate of 
this growing demand is still higher than the technological growth. Therefore, this reinforces the trend of 
improving scheduling and load balancing techniques. 

According to Leung and Anderson in [ ], a scheduling process involves modeling a range of different 
environments which differ in the way the information is released. They distinguish paradigms: the static 
scheduling, when all jobs with related information are available at the start of the horizon, and the dynamic 
scheduling where jobs have different release or available times. The authors pointed out that the decision maker 
must optimize (usually minimize) a given objective function. There are different categories of policies classified 
generally into: a) the class of static policies; when the decision maker has to specify at the outset all actions to 
be taken during the evolution of the process, b) the class of dynamic policies; decisions are made at any time as 
a function of all the information that has become available up to that point. 


47 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


In scheduling paradigm, a distinction is also made between i) offline deterministic scheduling, ii) stochastic 
scheduling, and iii) online scheduling [ ]. 

In offline deterministic scheduling , all information or data with regard to the problem is known a priori 
including: the number of jobs, their release dates, due dates, weights, etc. The resulting problem is known as a 
combinatorial optimization problem subject to some given constraints. 

In stochastic scheduling the number of jobs is fixed and known in advance. However, most or all of the 
parameters describing a job such as processing times, release dates and due dates are considered as random 
variables from known distributions. 

In online scheduling , there is even less information known before hand, it is released gradually to the decision 
maker. The decision maker knows nothing in advance about the release dates or processing times. 

To summarize, offline deterministic scheduling deals with perfect information, stochastic scheduling with 
input that is partially stochastic, while online deterministic scheduling deals with input that is known gradually 
as it arrives to the system. However, a model which mixes the above models may also be the subject of 
interesting studies. Indeed, Vredeveld [2] addressed the stochastic online scheduling (SOS) model. In this 
model, jobs arrive in an online manner and as soon as a job becomes known, the scheduler only learns about the 
probability distribution of the processing time and not the actual processing time. Both online scheduling and 
stochastic scheduling are special cases of this model. 

Let us note that deterministic scheduling models are based on predictive approaches that do not take into 
account the presence of disturbances, and evaluate the scheduling solution in terms of estimated data. In 
practice, however, this sort of scheduling becomes quickly unfeasible and returns poor performance. Indeed, in 
practice, scheduling environments are usually subject to significant amounts of randomness. 

As a result it is not of interest to spend an enormous amount of time figuring out a supposedly optimal solution 
when within a few hours random events will change the structure of the problem or the list of jobs [ ]. So, the 
hypothesis of determinism of scheduling problems is considered as restrictive, and the problem of scheduling 
with uncertainty management has been raised and is of interest to several researchers [3]. This has led to 
motivate research in the dynamic scheduling methods which consist in (re)allocating resources at run time [128] 
i.e. make decisions in real time given the state of resources and the progress of different tasks over time 1 . These 
methods use approaches other than the predictive ones, which until recently were known as proactive, reactive 
and hybrid which includes two sub-types: predictive-reactive and proactive-reactive approaches. The proactive 
approach computes by anticipation a scheduling solution by taking into account a priori knowledge about 
probable uncertainties. The reactive approach, another on-line approach, builds real-time scheduling solutions 
by taking into account any kind of uncertainty that may arise. Finally, it is possible to combine on-line (reactive) 
and off-line (proactive) approaches in order to get the advantages offered by the two models [ ]. 

1.1 Definitions and notations in scheduling problems 

The characteristics of jobs may be of the following: preemptive or non-preemptive, resumable or non- 
resumable, independent or linked. The latter is represented usually by a precedence graph. This graph is a 
directed acyclic graph that specifies the precedence constraints between tasks execution [136]. A job is called 
non-preemptive if the processing on this job which is assigned to a machine is processed until its completion. 
On the other hand, if this processing is interrupted before its completion and reassigned to either the same 
machine or some other machine, that type of job is called preemptive [ 0]. 

A job is said to be resumable, if it has been interrupted due to a machine non-availability period, it can be 
resumed without needing to be restarted after the machine becomes available again. But, the non-resumable job 
has to be restarted every time it is interrupted [ ]. 

Polynomial time [155]: The time complexity of an algorithm is said to be of polynomial time if the running 
time of this algorithm is 0(p(n)), where p(n) is a polynomial and n is the size of the input of the problem being 
solved. 

The setup time is defined in [20] as the time required to prepare the necessary resource, as machines or people, 
to perform a task or job operation. The setup cost is the cost to set up any resource used prior to the execution of 
a task (for more details see e.g. Allahverdi and Soroush [ ]). 

In the reminder of this paper, we will be using the following notations [ 27]: Q, F t = C t — r h L t = Q — d h w u 
7) = max(0,Lj), E t = {d t — Q), r h p h d i are respectively completion time, flow time, lateness, relative weight, 


1 Groupe d'Ordonnancement Theorique et Applique (GOThA) 
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tardiness, earliness, release date, processing time, due date of job i. Table I summarises the most used criteria of 
performance to evaluate the quality of a scheduling solution. 


TABLE I 

THE PERFORMANCE MEASURES TO EVALUATE THE QUALITY OF A SCHEDULING SOLUTION 


Measure 

Max 

Total 

Total weighted 

Average 

Average weighted 

Completion time 
Ci 

Cmax = max Ci 

1<1 <n 

Ctotal ~2ii=i Ci 

wCtota^tiWiCi 

Caverage - “Xi=i O? 

w C avera g e — ~Yii=i w iCi 

Flow time F t — 
Ci-n 

Pmax = max F t 

1 <i<n 

FtotarZUFi 

WF avera g e ~ 

;2?-i w,F, 

p - 1 yn p 

r average n Zii=l r i 

^F avera g e —f2ii = 1 WjFj 

Lateness L t = 

Ci-d t 

Lmax = max Li 

1 <i<n 

Ctotal = 21i =i L t 

WF avera g e — 

J =1 y V- I . 

u average n ^-n=l L, i 

WL avera g e —fYii= 1 tVjLj 

Tardiness T t = 
max(0, Li) 

T max = max Ti 

1 <i<n 

T _ V 'Tl rp 

1 total 2ji=i H 

w Faverage~ 
;n?=iWf Ti 

r r _1 yn y 

1 average n Lii=l 1 i 

^Coverage— ~ Xi=i WjTj 

Earliness E t = 

{■ d t - CJ 

Cmax = max E t 

l<i <n 

Etotal=Y.UEi 

WE avera g e — 

p yn p 

c average n Zii=i £1 i 

wE avera g e —fYii=\ WiEi 


Competitive ratio is the way to evaluate the performance of an online algorithm. The idea is to evaluate the 
quality of an on-line algorithm compared to an algorithm that receives the complete information. A competitive 
ratio is defined as [ 56]: 

Cost of executing the plan that does not know e in advance 
max- 

eeE Cost of executing the plan that knows e in advance 

Scheduling problems may be solved using either meta-heuristic algorithms or heuristic algorithms [ 57] i.e 
iterative or constructive methods that may deliver approximate solutions within a reasonable time. The popular 
ones are as follows: 

Genetic algorithms are initially developed to meet specific needs in biology. In the context of 
combinatorial optimization applications, an analogy is developed between an individual in a population 
and a solution of a problem in the global solution space [157]. 

The simulated annealing method; used in metallurgy to improve the quality of a solid and seeks a state of 
minimal energy that corresponds to a stable structure of the solid. The simulated annealing method is 
designed to solve local minima problems. [158]. 

PSO (Particle Swarm Optimization algorithm) is a cooperative, population-based global search swarm 
intelligent metaheuristic, presented by Kennedy and Eberhart in 1995 [33]. This is a powerful 
optimization technique for solving multimodal continuous optimization problems [ ]. It is also a 

population based stochastic optimization technique which has become popular due to its effectiveness and 
low computational cost. 

Longest Processing Time rule (LPT): Jobs with large processing time values are prioritized for 
scheduling. So, tasks are organized in descending order of their processing times. Shortest Processing 
Time rule (SPT): Jobs with small processing time values are prioritized for scheduling. So, tasks are 
organized in ascending order of their processing times [ ] 

Contingent schedule: “a contingent schedule allows different task resource assignments depending on how the 
execution of the schedule has proceeded so far. A contingent schedule can be viewed as a tree that assigns a 
possibly different task resource assignment for every possible execution [ ].” 

1.2 Context of the research topic 

The main objective in the machines scheduling theory is to find the best solutions everywhere in that broad 
scope with various areas, ;-such as production, medical, military, informatics services and telecommunications, 
etc. This scheme aimed to adopt different classifications of these problems each of which has been named, over 
the years, by different names among the following: static, deterministic, predictive, offline, stochastic, dynamic, 
proactive, reactive, online, etc. 

In the area of mobile networking, Cloud computing and Fog or Edge computing, which are the recent 
emerging domains of parallel machines scheduling, it is well justified to focus our study on the multi-machine 
model, dedicated machines and parallel machine model [ 7,1 ]. 

The parallel machine models are usually classified as follows: 

1. Identical parallel machines: all the machines are identical in terms of their speed. Every job will take the 
same amount of processing time on each of the machines. 
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2. Uniform (proportional) parallel machines: the machines have different speeds, whereas for each job, its 
processing times on the machines are inversely proportional to the speeds of those parallel machines. 

3. Unrelated parallel machines: the machines have different speeds with arbitrary processing times of jobs. 
In this type of scheduling, there is no relation amongst the processing times of a job on the parallel 
machines. 

Researches on scheduling problems have expanded over decades. These researches differ according to the 
perspective from which they are based, since there is no common perspective to address this issue. In fact, many 
surveys have been published over the last few years treating these problems of several points of view. 

This paper is an attempt towards a comprehensive overview of dealing with scheduling problems based on a 
significant number of surveys and individual articles that have been published in the literature over the past 
decade. 

This survey will focus on scheduling problems with multiple parallel machines or devices. This is motivated 
by the fact that the problem of scheduling, as mentioned above, has become, according to the accelerated 
technological progress, an evident requirement in recently emerging fields, such as modern factories, mobile 
networks, cloud computing, fog computing, smart cities, etc. 

The remainder of this paper is organized as follows. Section 2, 3 and 4 present respectively recent surveys and 
articles on scheduling problems in i) industry, ii) Cloud and Edge computing, and iii) mobile networks. These 
sections are presented in two forms: the first details some important works, while the second summarizes other 
surveys in a table illustrating their most important characteristics such as: reference, scheduling area, context, 
scheduling environment, conclusion, suggestions and tendencies. Section 5 highlights the discussion of the most 
important ideas, suggestions and tendencies in these surveys. In Section 6 we conclude the examined papers in 
terms of the recent scheduling trends in emerging areas such as Cloud, Edge / Fog and Mobile Computing. 

2 SURVEYS AND ARTICLES ON CLASSICAL AND INDUSTRIAL SCHEDULING PROBLEMS 

Many research algorithms and surveys have been published in the area of scheduling problems. In this section, 
we analyse and discuss briefly these works to understand the different approaches and algorithms used for their 
resolutions. 

Saidy et al. [ ] studied scheduling problems under various constraints such as activity duration, release dates, 

due dates and precedence constraints, and the availability of resources (or machines) that could affect these 
problems. In fact, these machines might not be available during certain time periods called holes for 
convenience [ !]. Their study is based on the classification of these problems into two classes: deterministic and 
stochastic. 

In this survey [ 1], Saidy et al. presented machine scheduling problems that have been studied in the literature 
with availability constraints in the resumable, semi-resumable and non-resumable cases, within different 
environments: single machine, parallel machines, flow shop, job shop, open shop, flexible flow shop and 
flexible job shop. They mentioned papers dealing with single machine and also parallel machine problems, 
having resumable, non- resumable and “crossable” availability constraints. They have defined as “crossable” the 
unavailability period that allows an operation to be interrupted and resumed after a period of time. Although, an 
unavailability period that prevents the interruption of any operation, even if the operation is resumable, as “non- 
crossable”. 

Saidy et al stated that in stochastic models the following parameters are not known before time: the 
processing times, the release dates, the starting time and the duration of the unavailability period. However, they 
assumed that the distributions of the processing times, due dates (deadlines), repair time and time at which 
breakdown occurs are known at time 0. The uptime and downtime of machines and the jobs processing 
requirement are assumed to be independent identically distributed random variables. 

They pointed out that heuristics are used for problems in both cases: when machines are continuously 
available and when breakdowns occur. At their days, they considered that this document may be a good 
reference for those interested in sequencing and job scheduling issues in the context of limited resource 
availability. 

Most of the heuristics with error bound analysis have been gathered in this study [ 1], noting that these 
heuristics, in some cases, produce optimal solutions. Also, the known Polynomial (P) and pseudo-Polynomial 
(pseudo-P) models were summarized in a single table whose results show that they are applicable to simpler 
problems with equivalent performance measures. The authors concluded that "if availability constraints come 
from unexpected breakdowns, fully online algorithms will be needed; but in case of preemptive scheduling, 
many results of optimality concern the best nearly online algorithms. It is an open question to look for the 
optimality results from fully on-line algorithms and specific availability patterns, or at least to compute 
performance bounds.” Finally, they stated that a direction is to assume that one operation cannot be interrupted 
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at any point of time, but only at given instants. Furthermore, they suggested that "authors work on more 
complicated problems such as sequencing n jobs on m resources in the Flow Job or Open Job environment 

[ii]-" 

Chaari et al. [ 13 ] noted that in the real-world several types of hard-to-predict risks must be considered in 
scheduling problems, and that scheduling under uncertainty allows taking these kinds of risks into account. 

They considered that the numerical values (e.g., execution time, machine speed) which are used in the 
scheduling methods as uncertain, incomplete or imprecise. Consequently, they focused on the dynamic 
scheduling more than on the deterministic scheduling problems because it is the real scheduling case. 

Different types of scheduling approaches under uncertainty environment exist in the literature as well as 
several dedicated typologies [ ]. Most of these typologies are based on the distinction between different 

scheduling approaches among the following: stochastic, fuzzy [ 151 ] proactive and reactive or hybrid which 
includes two subtypes: predictive-reactive and pro-reactive approaches. 

They proposed a global classification schemes technically independent and encompass new kind of scheduling 
algorithms under uncertainty [ ]: 

1. Proactive (also known as robust) scheduling approaches: this kind of approach tries to anticipate 

uncertainty while developing flexibility, in order to produce a schedule, or a family of schedules, that is 
relatively insensitive to uncertainty [ ]. In proactive approaches, five different techniques can be 

identified: techniques based on robustness measures, redundancy-based techniques, probabilistic 
methods, contingent scheduling and optimization-based techniques. 

2. Reactive scheduling approaches: these approaches are often used in highly perturbed real time 
environment when off-line scheduling becomes rapidly unfeasible. In this context, decision-making 
must be very fast and intuitively easy for users to understand [ 5 ]. Different methods are used to solve 
reactive scheduling problems involve distributed (multi-agent systems) or centralized approaches, 
priority rules and dynamic choice of priority rules. These approaches may exploit priority local criteria 
when a decision must be made in real-time. 

3. Hybrid approaches: these approaches can be subdivided into predictive-reactive and proactive-reactive 
approaches. 

A. Predictive - reactive approaches: These approaches, used to support risks, have two scheduling 
phases: 

i) First phase: a deterministic schedule is set up off-line. 

ii) Second phase: this schedule is used and adapted on-line. The on-line phase requires 

making scheduling decisions one at a time while the schedule is being built.. These 
decisions are then adapted in real time to take disturbances into account [ ]. 

Scheduling methods can be constructed by answering two questions: “When to reschedule?” and 
“How to reschedule?” 

B. Proactive - reactive approaches: In the proactive-reactive approaches, in contrast to predictive- 
reactive approaches, no rescheduling is done on-line; instead, one among several pre-estimated 
schedule solutions is chosen. This is what can make it possible to build a set of static schedules 
such that it is easy, in the event of risk, to pass from one to the other. 

“A new mixed technique presented by [ ] combines a proactive approach with a reactive 

approach to deal with scheduling problem under uncertainty. In the proactive phase, the authors 
built a robust baseline schedule that minimizes the schedules distance defined as the sum of the 
absolute deviations between the baseline and expected schedules. The robust baseline schedule 
contains some built-in flexibility in order to minimize the need of complex search procedures for 
the reactive scheduling approach.” 

The authors presented an updated and enriched classification scheme for the various approaches for 
scheduling in an uncertain environment. The more important are as follows [ $]. The first is about the notion of 
predicting uncertain events and measuring their impact on scheduling. The second concerns combining multi¬ 
agent approaches with optimization techniques for dynamic scheduling and dynamic control. This prospective 
has a lack of optimality is due to decisions that lead to the myopic behavior of decision-making entities [ ]. A 

possible solution would be to increase the intelligence of these entities by introducing optimization techniques in 
the decisional process. The third set of prospective research concerns the integration of new emerging 
technologies into existing products such as RFID (Radio-Frequency Identification), mechatronics and embedded 
infotronics. This integration enables the products to participate in the decision-making in a dynamic scheduling 
context. The fourth set of prospective concerns the new possibilities for designing and dimensioning systems 
based upon scheduling performances. This takes in account the agility perspective and the issues of evolution 
and improvement in the production system's (re)design & (re)engineering process. 
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Kaabi et al. [ 7 ] focused on results related to the specific real industrial problem of parallel machine scheduling 
under availability constraints. The authors present parallel machines problems under different constraints with 
the appropriate solving algorithms along with their complexities under various objective functions. By 
emphasizing on certain concepts, they pointed out the following: 

• Parallel machines scheduling is at hand when machines of similar type and eventually slightly different 
in characteristics are available in multiple numbers. Jobs can be processed over these machines 
simultaneously. 

• The machines may be subject to accidental breakdowns, periodic preventive maintenance (mainly non¬ 
availability), tool changes, workers availability, and availability of the resources used by the machines, 
and so on. 

• There are two classes of problems considered depending on whether the scheduling of preventive 
maintenance activities is determined before the scheduling of jobs or jointly with the scheduling of 
jobs. Class 1: in the case that the two activities of the production scheduling and the maintenance 
“planning” are generally planned and executed separately in real manufacturing systems. Class 2: when 
maintenance and production services collaborate in order to maximize the system productivity. 
Maintenance strategies can be broadly classified into Corrective Maintenance (CM) and Preventive 
Maintenance (PM). 

• The job scheduling under maintenance constraints was generally applied to a single machine and multi¬ 
machine models. In this paper, the authors deal only with scheduling parallel machines subject to 
availability constraints. 

The authors use multiple tables to represent: Most often used notations in scheduling in Table I, main results 
of one machine scheduling under availability constraints in Table II and main results on scheduling identical 
parallel machines under availability constraints in Table III [ 7 ]. Subsequently we present two tables 
summarising the results extracted from this survey concerning the scheduling problems running over uniform 
and unrelated parallel machines under availability constraints. 


TABLE II 

SCHEDULING PROBLEMS RESULTS OVER UNIFORM PARALLEL MACHINES: 


Measure to be optimized 

Solving Algorithm 

Improvement of complexity or competitive ratio 

Ref. 

- Q li2 \online\Cmax (to 
denote the problem of 
online scheduling on 
two uniform parallel 
machines where one 
machine is periodically 
unavailable to minimize 
the makespan) 

Optimal algorithm 
(as declared by 
authors) 

An online scheduling is investigated on m parallel machines and on two 
uniform parallel machines where there is one machine periodically 
unavailable. In the latter case the length of each available period is normalized 
to 1 while the speed of the other one is s>0. If s > 1: speed of the 2nd 
machine, the proposed algorithm is optimal with a competitive ratio 1+1/s. In 
the case where 0 < s < 1, the authors proposed some lower bounds on 
competitive ratio. 

[129] 

If s=l and jobs arrive in decreasing sequence and proved that proposed is 
optimal with competitive ratio 3/2. 

[130] 


TABLE III 

SCHEDULING PROBLEMS RESULTS OVER UNRELATED PARALLEL MACHINES: 


Measure to be optimized 

Solving Algorithm 

Improvement of complexity or 
competitive ratio 

Ref. 

Total machine load on m unrelated 
parallel machines with maintenance 
activity (ma) 

m 

Rm |P ;n ,ma < h Jk | ^ Ci 

i=1 

Two efficient algorithms 

Complexity O (n m+3 ) 

[133] 

Total machine load 

m 

Rm | Pj n = Pij + abji,ma < h jk \ ^ Ci 

i=1 

Minimizing the total completion time or 
the total machine load on m unrelated 
parallel machines 

An algorithm that considers a 
deterioration of maintenance activities 

Complexity O (n m+3 ) 

[13 ] 

[132] 

Minimizing the total completion time on 
m unrelated parallel machines 

An algorithm that reconsiders 
simultaneously a deterioration effects 
and deteriorating multi-maintenance 
activities. 

Find jointly the optimal maintenance 
frequencies, the optimal maintenance 
positions, and the optimal job sequences 
with a polynomial time algorithm 

[134] 


52 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 
















International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


The authors concluded that “the maintenance activities can be planned in a flexible or in a non-deterministic 
ways. In fact, the machines are subject to random breakdowns.” In addition, the assumption of non resumable jobs 
needs to be taken into account for many real life problems. “Considering more realistic constraints such as online 
scheduling, resumable jobs, and nondeterministic availability constitute interesting research directions [7].” 

The survey paper [ ] of Allahverdi provides an extensive review of about 500 papers that have appeared 

since the mid 2006 to the end of 2014, including static, dynamic, deterministic, and stochastic environments. 
These survey papers classify scheduling problems based on shop environments as single machine, parallel 
machine, flow shop, job shop, or open shop. It further classifies the problems as family and non-family as well 
as sequence-dependent and sequence-independent setup times/costs. In this paper the focus is on the setup 
times/costs [20] factor ignored by most of the existing scheduling literature. Allahverdi drew up several tables 
that present the articles references with their criteria or measures to be optimized and their used approaches. 
These tables are built based on the shop environments as single machine, parallel machine, etc., and according 
to the problems classification as family and non-family as well as sequence-dependent and sequence- 
independent setup times/costs. The author summarized the results as follows: 

• Heuristics solutions methods have been more used than (the double of) the exact methods. 

• The genetic algorithm has been the first one used among the heuristics followed by the Simulated 
Annealing (SA). 

• Among the exact solutions methods, the Mixed Integer Programming (MIP) and the Branch and Bound 
(B&B) are the most used methods. 

He-concluded the need for: 

• More research on scheduling problems with explicit consideration of setup times/costs. 

• Considering family setup time for the parallel and job shop environments. 

• Addressing the sequence-dependent scheduling problems in single machine environments with family 
setup times. 

• Addressing more scheduling problems with multiple criteria. 

• Addressing more scheduling problems with uncertain setup times. 

In [ ] the dynamic job shop scheduling that considers random job arrivals and machine breakdowns was 

studied. “Considering an event driven policy rescheduling, is triggered in response to dynamic events by 
variable neighborhood search (VNS). A trained artificial neural network (ANN) updates parameters of VNS at 
any rescheduling point. Also, a multi-objective performance measure is applied as objective function that 
consists of makespan and tardiness. The proposed method is compared with some common dispatching rules 
that have been widely used in the literature for dynamic job shop scheduling problem.” 

This paper [ 19] is characterized by its study of the static scheduling problem of m identical parallel machines 
with a common server and sequence dependent setup times. In fact, according to the best knowledge of the 
authors, it is the first such study in the literature. The authors focused, in their study, on the comparison of the 
performance of the proposed a Mixed Integer Linear Programming (MILP) model, Simulated annealing (SA) 
and Genetic Algorithm (GA) based solution approaches with the performance of basic dispatching rules such as: 
shortest processing time first (SPT) and longest processing time first (LPT) over a set of randomly generated 
problem instances. 

The MILP model is presented with the SPT and LPT dispatching rules for the problem to minimize the 
makespan. But, according to the authors, the MILP model is not able to solve the large scaled problem instances 
due to the NP-Hard nature of the problem. For this reason, simulated annealing (SA) and genetic algorithm 
(GA) based solution approaches are proposed for solving the large scaled problem instances. As a result, based 
on the computational experiments, the proposed GA is generally the most efficient and effective (followed by 
the AS approach) in solving this problem. As the problem size increases, the GA approach finds better solutions 
with smaller standard deviations. Genetic algorithms (GA) are the basis of stochastic optimization algorithms, as 
they can also be used for machine learning. 

Kia et al. studied in [ ] a dynamic flexible flow line problem with sequence-dependent setup times for 
minimizing the mean flow time and mean tardiness. By applying genetic programming framework and choosing 
proper operators, four new composite dispatching rules are proposed to solve this NP-hard problem. To examine 
scheduling rules performances, a discrete-event simulation model is made considering four new heuristic rules 
and the six adapted heuristic rules from the literature. 

In [ ], Rintanen considered that the contingent approach recognizes that different schedules are needed 

under different contingencies, and computes them either off-line before the execution phase or on-line as 
information about the contingencies becomes available. This is the most general approach, eliminating the 
limitations (incompleteness, sub-optimality) of the other approaches at the cost of increased complexity.” He 
investigated the properties of some classes of contingent scheduling problems. In these problems assignments of 
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resources to tasks depend on resource availability and other facts that are only known fully during execution. 
Therefore, the off-line construction of one fixed schedule is insufficient. He demonstrated generally that 
contingent scheduling is most likely outside the complexity class NP “Their results prove that standard 
constraint-satisfaction and SAT (Site Acceptance Testing) frameworks are in general not straightforwardly 
applicable to contingent scheduling.” 

Skutella et al. address in [ ] two main characteristics encountered generally in real-world scheduling 

problems: the heterogeneous processors and a certain degree of uncertainty about the sizes of jobs. They 
studied, for the first time according to their best knowledge, a scheduling problem that combines the classical 
unrelated machine scheduling model with stochastic processing times of jobs. 

For the stochastic version of the unrelated parallel scheduling problem and with the objective of weighted sum 
of completion times R\(ri J )\T l WjCj, they calculated in polynomial time a scheduling policy with a 
performance guarantee of(3+A)/2 + £ using a novel time-indexed linear programming relaxation. They 
showed that when jobs also have individual release dates, their bound is (2 + A) + £ where, A is an upper 
bound on the squared coefficient of variation of the processing times and £ > 0 is arbitrarily small. They 
showed that the dependence of the performance guarantees on A is tight. 

On the deterministic side, the current best known approximation algorithms for unrelated parallel machines 
have respectively performance guarantees 3/2 and 2 for the problem without and with release dates [121, F 
25]. Improving these bounds is considered one the most important open problems in scheduling (see 
Schuurman and Woeginger [ ]). 

On the Stochastic front, the authors stated that they consider for the first time the stochastic variant of 
unrelated parallel machine scheduling. In stochastic scheduling, it is asked to compute a non-anticipatory 
scheduling policy which must make its decisions at an indicated time based on the observed past up to this time 
as well as the a priori knowledge of the input data of the problem. Here, the processing time of a job j on 
machine i is given by random variable P t j. The authors assume that the random variables P t j are stochastically 
independent across jobs. For any given non-anticipatory scheduling policy, the possible outcome of the 
objective function £ WyCy is a random variable. Then, the goal is to minimize its expected value, which by 
linearity of expectation equals £ WyEfCy]. 

The authors mentioned that for the first time they completely departed from the linear programming relaxation 
of Mohring et al. [ ], and showed how to put a novel, time-indexed linear programming relaxation to work in 

stochastic machine scheduling. According to the authors, this approach will inspire further research for other 
stochastic optimization problems in scheduling and related areas. In addition, they showed how to overcome the 
difficulty that scheduling policies feature a considerably richer structure including complex dependencies 
between the executions of different jobs which cannot be easily described by time-indexed variables. As a 
result, they presented the first time-indexed LP relaxation for stochastic scheduling on unrelated parallel 
machines. Here, they calculated the probability value of a job j being started on machine i at time t which can 
be represented by the time-indexed variable x t j t . The situation is complicated in the stochastic context, and it 
requires a fair amount of information on the exact probability distributions of the random variables. Some other 
surveys on classical and industrial scheduling problems are summarized in Table IV. 


TABLE IV 

PROPERTIES EXTRACTED FROM SOMEADDITIONAL SURVEYS OF CLASSICAL AND INDUSTRIAL SCHEDULING 

PROBLEMS 


Authors / 

Reference 

Scheduling 
problems Context 

Scheduling problems environment 

Conclusion and Suggestions 

Samia 
Ourari / [ ] 

Deterministic to 

distributed 
Scheduling 
approaches based 
on cooperation 

Scheduling approaches under 
uncertainties: reactive, proactive and 
proactive-reactive approaches. They 
differ according to how uncertainty is 
taken into account, either offline or 
on-line, single machine & job shop 

It is important to address the problem of managing 
uncertainty in scheduling in order to reduce the gap between 
the theory (estimated or expected scheduling) and the 
practical field (scheduling really implemented or adopted). 

Janiak et 
al. / [ 105 ] 

Offline & online 
scheduling 

Single machine & parallel machine 
scheduling, 

Just-in-Time (JIT) scheduling models, 
PERT/CPM (Program Evaluation 
Research Technique)/(Critical Path 
Method) scheduling 

A currently noticeable trend in this area is the concept of 
combining pure due window scheduling problems with 
other new and trendy phenomena, like e.g., learning or 
aging effects, deteriorating jobs, maintenance activities, etc. 
These models need practical trend more than theoretical 
trend and among their future trends are: the analysis of 
scheduling problems with due windows (multiple due 
windows) and preemptive jobs or precedence constraints. 


3 SCHEDULING PROBLEMS IN CLOUD COMPUTING 

Masdari et al [32] addressed scheduling problems in the Cloud environment that, like other environments, 
still lacks adequate and effective solutions for load balancing and scheduling tasks and workflows. The tasks or 
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jobs are mapped to the appropriate Virtual Machines (VMs) which are generated virtually from the single 
physical machine to optimize some given scheduling measures. Various heuristic, metaheuristic and exact 
algorithms are applied to study the Cloud scheduling problem. In this paper, the authors present an in-depth 
analysis of the Particle Swarm Optimization (PSO)-based task and workflow scheduling schemes, proposed in 
the literature, for the cloud environment. Moreover, they provide a classification of the proposed scheduling 
schemes based on the type of the PSO algorithms illuminating their objectives, properties and limitations. In the 
Particle Swarm Optimization algorithm, the swarm of particles is randomly generated initially, and each particle 
position in the search space represents a possible solution and has a fitness value and velocity to determine the 
speed and direction of its moves. By moving and updating position and velocity, particles get an optimized 
solution [ 5, 37]. The authors claim that, following the repeated advances (called iteration), the particle swarm 
gradually approaches the optimal location [ }. 

The authors present the papers in literature that propose the PSO algorithms with different schemes: 

• In Standard PSO Schemes such as Guo et al [ ], Zhang et al [ ], Yang et al [ 38 ], Huang et al [ 55], 

Pandey et al. [ ], Wu et al [ ] and Jianfang et al [ ]. 

• In Multi-Objective 2 PSO Schemes, several contributions were cited in the literature: Netjinda et al [ 13], 

Wang et al [13 ], Ramezani et al [ ] and Yassa et al [ ]. 

• In Bi-Objective 3 PSO Schemes; the papers of Beegom et al [ 8] and Verma et al [5' ]. 

• In Hybrid 4 PSO Schemes; the papers of Zhan et al [ 6], Kaur et al [ ], Visalakshi et al [52], 

Krishnasamy and Gomathi et al [ )], Xue et al [ ], JieHui et al [ ] and Xiaoguang et al [ ]. 

• In Learning PSO Schemes which use learning PSO for scheduling in cloud environment; the papers of 

Zuo et al [ ] and Chen et al [ ]. 

• In Jumping PSO Schemes (which is proposed to optimize the load balancing, the speed-up ratio, and the 

makespan) the paper of Chitra et al [ ]. 

• In Modified PSO Schemes (it is an improved PSO to overcome the drawbacks of the standard PSO or to 
increase its performance) the paper of Tarek et al [ 3], Zhao et al [ > 1 ], Pragaladan et al [58] and Abdi 
et al [62]. 

Properties of these PSO-based scheduling schemes are illustrated in five tables in this survey. Based on these 
tables, Table V shows in the form of a report a summary of how the schemes studied respond to these 
properties. 


TABLE v 

PROPERTIES REPORT OF SCHEDULING SCHEMES STUDIED IN [32], 


PSO 

Scheduling 

scheme 

References of 
scheduling 
schemes 
studied 

The rate of scheduling schemes studied in this survey and which deal with these properties. 

Objectives 

Scheduling type 

Minimizing 

cost 

Minimizing task 
execution time 

QoS support 

Minimizing 

makespan 

Task 

Workflow 

Standard 

[ 39 . 

351 

6/7 5 


2/7 

4/7 

3/7 

4/7 

Multi Objective 

[ 43 , 

39 ] 

4/4 

2/4 

2/4 

1/4 

3/4 

1/4 

Bi-Objective 

[ 48 , 50 ] 

2/2 

1/2 

- 

1/2 

1/2 

1/2 

Hybrid 

[ 46 , 

, 51 ] 

1/7 

5/7 

1/7 

1/7 

6/7 

1/7 

Learning 

[ 47 , 45 ] 

- 

- 

2/2 

- 

1/2 

1/2 

Jumping 

[ 59 ] 

- 

- 

- 

1/1 

- 

1/1 

Modified 

[ ] 

3/4 

2/4 

- 

1/4 

3/4 

1/4 


The authors stated that metaheuristic algorithms present better results than deterministic algorithms (with a 
particular given input, they produce always the same output) in terms of the quality. Likewise, they find 
approximate solutions faster than traditional exhaustive algorithms in terms of the computation time [138]. 

Finally, the authors concluded that scheduling is a critical process to map the cloud tasks to the VMs and to 
reduce their cost and time of execution. Moreover, the future research on scheduling problem should consider, 
investigate, evaluate and enhance various security related factors in the task and workflow scheduling solutions. 
As well, the heterogeneous resources functionality, the load balancing on VM (virtual machines) and data center 
network to reduce their energy consumption. Also, scheduling tasks and workflows on hybrid and federated 
clouds should be studied more. 


2 This is the process of simultaneously optimizing two or more conflicting objectives subject to a number of constraints. 

3 This is the process which solves optimization models for two objective functions respectively. 

4 To overcome some of the limitations of the PSO, one or more algorithms such as Genetic Algorithm, Ant Colony, etc., are 
integrated with the PSO algorithm. 

5 e.g., 6 scheduling schemes out of a total of 7 dealt with “Minimizing cost”. 
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Ramathilagam and Vijayalakshmi [ ] noted that there is no typical effective task scheduling algorithm 

employed in the Cloud environment. Furthermore, the well-known task planners have great difficulty in being 
implemented in a large-scale distributed environment due to the high communication charge. This requires the 
building of compatible and applicable job scheduling algorithms and load balancing techniques in this large 
scale environment. “To balance the load in cloud the resources and workloads must be scheduled in an efficient 
manner. A variety of scheduling algorithms are used by load balancers to determine which backend server to 
send a request to [15 ].” Cloud computing approaches employed the latest technology which is increasing 
considerably. Job scheduling is one of the process done with the aim to efficiently enhance the functioning of 
cloud computing atmosphere with achieving maximum profit. In this paper, the authors have investigated and 
discussed concisely several scheduling algorithms and issues in cloud computing. These algorithms fall into two 
groups: static and dynamic. Both have their own merits and demerits. 

Also, the authors [63] surveyed various types of task scheduling algorithms in Cloud computing. These 
algorithms are included and compared in Table 1 in Section 5 of their survey. They declared that the heuristic 
based algorithm, that's belonging to a subset of meta-heuristic approach, is one of the important means to 
achieve the optimal or near optimal solution of task scheduling in the cloud environment. Many task scheduling 
techniques that employed in the cloud environment are classified into the following three categories: 

1. Traditional techniques which are simple and deterministic, but they get stuck in local optima [67]: First 
Come First Serve (FCFS), Round Robin (RR) and Shortest Job First (SJF) etc. 

2. Heuristic Techniques which are used to find the optimal or near optimal solution by using a sample 
space of random solutions are: min-min, max-min, enhanced max-min [66] and priority based min-min 
etc. These techniques give better results as compared to the traditional approaches [6< ]. 

3. Meta-heuristic techniques make use of random solution space for tasks scheduling. The principal 

difference between heuristic and meta-heuristic is the first one is problem specific while the second one 
is problem independent [ ]. Meta-heuristics generally have functional similarities with the aspects of 

the science of life (biology): (a) Meta-heuristics based on gene transfer: Genetic algorithms and 
Transgenic Algorithm; (b) Meta-heuristics based on interactions among individual insects: Ant Colony 
Optimization, Firefly algorithm, Marriage in honey bees Optimization algorithm, Artificial Bee Colony 
algorithm; and (c) Meta-heuristics based on biological aspects of alive beings: Tabu Search Algorithm, 
Simulated Annealing algorithm, Particle Swarm Optimization algorithm and Artificial Immune System 
[153]. 

Based on this extensive survey, the authors concluded that there remain many problems and issues to enhance 
as the need of scheduling techniques that covers all requirements accurately. Another issue which is a very vital 
in scheduling algorithms is the energy efficiency (energy consumption, energy savings, energy sufficiency, etc.). 
For a multiple workflows, metrics like reliability and availability should also be considered. 

Sharma et al describe in [ ] the work done in the field of task scheduling algorithms. These algorithms are 

classified as follows: 

1. Efficient Task Scheduling Algorithm: Sindhu et al [65] proposed an enhanced task scheduling 
algorithm to minimize the completion time of cloudlets. Their approach has two algorithms named as, 
Longest Cloudlet Fastest Processing Element (LCFP) and Shortest Cloudlet Fastest Processing Element 
(SCFP). 

2. Improved Min-Min Algorithm: Kaur et al [68] proposed an improved min-min algorithm to achieve 
the maximum resource utilization in distributed environment. This algorithm consists of two phases: 
the first one is similar to the traditional min-min algorithm, in which minimum completion time of each 
task is calculated. In the second phase, tasks are rescheduled to make selection of those resources 
which have been unutilized for a long period of time. 

3. Enhanced Max-Min Algorithm: In order to optimize the task scheduling in cloud computing 
environment, Santosh et al [73] proposed an Enhanced Max-Min Algorithm that consists of two 
algorithms using respectively the arithmetic and geometric means for calculating average time of job 
execution instead of maximum completion time. Then the job which has execution completion time 
just greater than the calculated average time is selected. If jobs are independent of each other, then 
arithmetic mean gives the best time average. But, on the contrary the geometric mean calculates the 
best average of time. 

4. Selective Algorithm: A selective algorithm has been proposed by Kobra Etminani et al [ ] for 

ensuring the QOS. This algorithm uses the advantages of the two basic scheduling algorithms min-min 
and max-min and tries to overcome their disadvantages. The selective parameter is the standard 
deviation of the completion time of unassigned tasks in Meta task. 

5. Optimized Task Scheduling Algorithm: for improving scalability in the cloud environment, Shubham 
Mittal et al [ 2] proposed an optimized task scheduling algorithm. The authors took into account five 
algorithms (min-min, max-min, RASA, improved max-min and enhanced max-min). 
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6. Improved Task Scheduling Algorithm: Abdul Razaque et al [ T] proposed an improved task 
scheduling algorithm to achieve the proper utilization of the network bandwidth in the cloud computing 
environment. They use a non-linear programming model for assigning a proper number of tasks to each 
virtual machine. 

7. Cloud-based Workflow Scheduling Algorithm: Bhaskar Prasad Rimal et al [75 ] proposed a Cloud- 
based Workflow Scheduling Algorithm in order to enhance the workflow in a multi-tenant cloud 
environment. The authors have defined the Workflow as a new Service layer and the fourth one on the 
top of the Infrastructure. They used a Directed Acyclic Graph (DAG) to represent the workflow in the 
systems. The labels on the nodes and the edges represent respectively the costs of computation and of 
communication. This algorithm uses the ideal time of resources, reduces the makespan, properly 
utilizes resources and minimizes the cost. 

8. Task Scheduling Algorithm based on Quality of Service (QoS): A Task Scheduling - Quality of 

Service (TS-QOS) algorithm has been proposed by Xiaonian Wu et al [ 16] for optimizing the service 
quality in cloud computing. Firstly the algorithm computes the priority of each task on the basis of 
certain parameters and then it sorts the entire list of tasks according to their priority. The task having 
minimum completion time is considered as highest priority task and gets the resource first for job 
completion. Three indexes are taken in account for measuring the performance of this algorithm: i) the 
makespan, ii) the average waiting time of the longest task (Average Latency), iii) the load balancing 
index (LBI) used to determine the loading conditions of the system and maximum system loading 
capacity [ l]. 

9. Task Scheduling Algorithms with Multiple Factors: Nidhi Bansal et al [ ] proposed a comparison 

between traditional scheduling methods, i.e. FCFS, optimization method, QoS-driven, ABC (Activity 
Based Costing) and priority based algorithms, etc., by using CloudSim as a simulator. The authors 
showed that the resource utilization and cost factor are the main criteria in any scheduling algorithm to 
deal at best. Also, they stated that optimization based methods performed better as compared to the 
traditional methods. 

As a result of their study, the authors concluded that there is no such heuristic approach which can fulfill all 
the required parameters. However, they can perform better when some particular parameter among resource 
utilization, execution time for each task, and workflow, and so on are considered at a time. 

Abbas and Zhang in [ 83 ] study the Mobile Edge Computing (MEC), which is an emergent architecture that 
extends the cloud computing services to the edge of networks leveraging mobile base stations. It can be applied 
to mobile, wireless and wireline scenarios, using end-users software and hardware platforms located at the 
network edge. 

The mobile networks suffer from low storage and energy capacity, low bandwidth, and high latency [8 ]. 
Moreover, exponential growth of the emerging Intemet-of-Things (IoT) technology is foreseen to further 
stumble cellular and wireless networks [ 85 ]. The edge computing Fog computing) [ ], has begun to be of 

paramount significance, especially Mobile Edge Computing (MEC) in mobile cellular networks. MEC is 
equipped with better offloading techniques that characterize the network with low-latency and high-bandwidth. 

The basically contribution of this paper is surveying MEC. A few MEC survey reports such as [8 ] and [ 88 ] 
exist in the literature. It provides a brief overview of different attributes of MEC and identifies the major open 
research challenges in MEC. In addition, it presents an extensive survey on mobile edge computing focusing on 
its general overview. 

Subsequently, several research efforts were recently carried out in the area of MEC. They are classified 
according to different domains: 

1. Offload computation: This is a way to improve the capacity of mobile devices by transferring the 
computation to higher resourceful servers that are located at a different location [ 92 ]. The improvement 
mobile devices and networks will still not be able to cope with the increased demand on these devices. 
As a result, mobile devices will always have to compromise with their limited resources, such as 
resource-poor hardware, insecure connections, and energy computing tasks [ 89 ]. 

In 2015, many algorithms or prototypes are proposed such as i) Edge Accelerated web Browsing 
(EAB) prototype proposed by Takahashi et al [91]. It is designed for web applications using a better 
offloading technique, ii) An algorithm-based design, called Successive Convex Approximation (SCA) 
proposed by Ardellitti et al [ 90 ]. This algorithm optimizes computational offloading on multiple 
densely deployed radio access points, iii) FemtoCloud system proposed by Habak et al [ 93 ] which 
forms a cloud of orchestrated co-located mobile devices that are self-configurable into a correlative 
mobile cloud system 

In 2016, other algorithms or prototypes are proposed: a) the efficient computation offloading model 
designed by Chen et al [ 94 ] using a game theoretic approach in a distributed manner. Game theory is a 
persuasive tool that helps simultaneously connected users to make the correct decision when 
connecting a wireless channel based on the strategic interactions, b) the contract-based computation 
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resource allocation scheme proposed by Zhang et al [9 ]. It improves the utility of vehicular terminals 
which intelligently utilize services offered by MEC service providers under low computational 
conditions. 

2. Low Latency: MEC is equipped with better offloading techniques that characterize the network with 
low-latency and high-bandwidth. So, it is one of the promising edge technologies that can improve user 
experience by providing high bandwidth and low latency. 

In 2015, Nunna et al [ ] proposed a real-time context aware collaboration system by combining MEC 

with 5G networks. This integration of MEC and 5G helps to empower real time collaboration systems 
utilizing context-aware application platforms. These systems require context information combined 
with geographical information and low latency communication. 

In 2016, two schemes are proposed: i) REPLISOM designed by Abdelwahab et al [ 96 ] is an edge 
cloud architecture and (Long Term Evolution) LTE enhanced memory replication protocol to avoid 
latency issues. LTE bottleneck occurs when allocating memory to a large number of IoT devices in the 
backend cloud servers, ii) Kumar et al [ )] proposed a vehicular delay tolerant network-based smart 
grid data management scheme. The authors investigated the use of Vehicular Delay-Tolerant Networks 
(VDTNs) to transmit data to multiple smart grid devices exploring the MEC environment. 

3. Storage: To overcome their device storage limitation, end-users may utilize MEC resources. 

In 2016, Jararweh et al [9 ] proposed a framework connects software defined system components to 
MEC to further extend MCC (Mobile Cloud Computing) capabilities, Software Defined system for 
Mobile Edge Computing (SDMEC). The components jointly work cohesively to enhance MCC into the 
MEC services. 

4. Energy Efficiency: The MEC architecture is created to reduce energy consumption of user devices by 
migrating compute intensive tasks to the network edge. 

Many schemes are developed in this field: 

In 2014, an opportunistic peer-to peer mobile cloud computing framework was proposed by Wei Gao 
[ 103 ]. The probabilistic framework is composed of peer mobile devices connected via their short-range 
radios. Based on their available capacity, these mobile devices are able to share energy and 
computational resources. The author proposed the probabilistic method to estimate the opportunistic 
transmission status of the network ensuring that the resulting computation is timely delivered to its 
initiator. 

In 2015: i) an architecture that integrates MEC to voice over LTE called ME-VoLTE was proposed by 
Beck et al [ 0 ]. The encoding of video calls is offloaded to the MEC server located at the base station 
(eNodeB). The offloading of video encoding through external services helps escalating battery lifetime 
of the user equipment. Encoding is high computational-intensive and hence is very power consuming, 
ii) El-Barbary et al [ 100 ] proposed DroidCloudlet; an architecture to enhance mobile battery lifetime 
by migrating data-intensive and compute-intensive tasks to rich-media. Based on commodity mobile 
devices DroidCloudlet is legitimized with resource-rich mobile devices that take the load of resource- 
constraint mobile devices, iii) in 2016 Jalali et al [ ) 2 ] proposed a flow-based and time based energy 
consumption model. They conducted number of experiments for efficient energy consumption using 
centralized nano Data Centers (nDCs) in a Cloud computing environment. But, the authors claim that 
nDCs energy consumption is not yet been investigated. 

The authors [ 83 ] concluded that as a recent technology platform, little research has been specifically done in 
MEC. In fact, there are some open issues in MEC that need to be addressed. Many researchers interested by 
MEC have studied some problems of these issues that belong in several areas including: Resource Optimization, 
Transparent Application Migration, Web Interface, Security, Pricing, Network Openness, Multi-services and 
Operations, Robustness and Resilience. 

We summarise, in Table VI, significant properties extracted from some other surveys of scheduling problems 
in Cloud environment: context, environment, conclusion and suggestions 


TABLE VI 

ADDITIONAL SURVEYS OF SCHEDULING PROBLEMS IN CLOUD ENVIRONMENT 


Authors / 

TReference 

Scheduling problems 
Context 

Scheduling problems environment 

Conclusion and Suggestions 

S. Yi et al / 
[106] 

Static, Dynamic, Real 
Time and Heuristic 
Scheduling 

The main issues: 

Fog networking (is heterogeneous): Internet 
of Things, software-defined networking, 
network function virtualization (NFV) to 
create flexible and easy maintaining network 
environment. 

Quality of Service (QoS): connectivity, 
reliability, capacity, and delay. 

Fog computing will evolve with the rapid 
development in underlying IoT, edge devices, 
radio access techniques, SDN, NFV, VM and 
Mobile cloud. We think fog computing is 
promising but currently need joint efforts from 
underlying techniques to converge to "fog 
computing". 
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Interfacing and programming model. 

Computation Offloading. 
Accounting, billing and monitoring. 
Provisioning and resource management: 
Application-aware provisioning, Resource 
discovery and sharing. 

Security and Privacy 


Deshmane et 
al / [ 107 ]. 

Static, Dynamic, Real 
Time and Heuristic 
Scheduling 

Enhanced Max-min, Improved Genetic, 
Scalable Heterogeneous Earliest-Finish- 
Time (SHEFT) Improved Cost-Based, 
Resource-Aware-Scheduling, Innovative 
transaction intensive cost-constraint 
scheduling, Algorithms and Multiple QoS 
Constrained Scheduling Strategy of Multi- 
Workflows (MQMW) 

There is a need to implement a new scheduling 
algorithm to minimize the execution time and 
improve availability and reliability in a cloud 
computing environment. The improvement can 
also be done with building algorithms that take 
user preferences while scheduling. Also, one 
more aspect can help improving the design of 
algorithm, which can include new factors such 
as inter-node bandwidth etc., that have not 
been considered for resources matching. 

Ahmed et al. 

/ [ 88 ] 

Static, Dynamic, Real 
Time and Scheduling 

The contributions of this article are as 
follows: (a) survey of the state of-the-art 
research, (b) Preview taxonomy based on 
various parameters such as characteristics, 
actors, access technologies, applications, 
objectives, computational platforms, and key 
enablers, (c) Identification of various open 
challenges related to the Mobile Edge 
Computing that impede or prevent the 
successful deployment. 

The open research challenges in Mobile Edge 
Computing: Standard Protocol, Simulation 
Platform, Mobility Management, 
Heterogeneity, Pricing Model, Scalability and 
Security 

Singh et al. / 

[ 109 ] 

Dynamic / static 
scheduling and 
allocation of resources 

Resource scheduling algorithms (RSA) and 
dynamic RSAs, Bargaining Based RSA, 
Compromised Cost and Time based RSA, 
Cost Based RSA, Dynamic and Adaptive 
Based RSA, Energy Based RSA, Hybrid 
Based RSA, Nature Inspired and Bio- 
Inspired Based RSA, Optimization Based 
RSA, Profit Based RSA, Priority Based 
RSA, SLA and QoS Based RSA, Time 
Based RSA and VM Based RSA 

Resource 

scheduling aspects and resource distribution 
policies 

Recent research has shown that resource 
scheduling algorithms using resource 
provisioning mechanisms and applying the 
effective resource provisioning technique. 

On the basis of existing research, it is 
necessary to fully understand QoS 
requirements for workload for better allocation 
of resources rather than to detect workload and 
resources. It is necessary to find the progress in 
the search on the cloud itself before finding the 
advanced search in the scheduling of resources. 

Manpreet 

Kaur/ 

[ 110 ] 

Static & Dynamic 
scheduling 

Min-Min, Max-Min, RASA, Shortest Job 
First, heuristic, Dominant Resource Priority 
and multi-objective task Scheduling 
Algorithms 

Future work would be to continue the multi¬ 
objective scheduling improvement. The 
authors have done the non- dominated sorting 
of virtual machines (VMs) according to MIPS. 

In future they aim to take other parameters of 
VMs also to sort them for better performance. 


4 SCHEDULING PROBLEMS IN MOBILE NETWORKS 


Many surveys have been published recently with regard to scheduling problems in mobile networks. In this 
section, we address these surveys briefly. 

Mahidhar et al [ 21 ] addressed the problem of the wireless sensor network (WSN) by providing dynamic 
scheme. WSN is a highly distributed network of small and light nodes. This problem is demonstrated by the 
limited battery life of the nodes. Sensor nodes spend their energy in transmitting and receiving the data, as well 
as, in the relaying of the packets. This implies designing the routing algorithm that maximizes the lifetime of the 
network. Packet scheduling is important in WSN to maintain fairness based on data priority and to reduce end- 
to-end delay. The authors proposed the Dynamic Multilevel Priority (DMP) Packet Scheduling Scheme with the 
Bit Rate classification; the data is divided into three categories as high, moderate and low bit rate. They also 
proposed the threshold value check mechanism to prevent deadlock situations. To provide security they 
implement the RC6 security algorithm. 

Another important implication in real-time WSN data transmission is the packet scheduling at node sensors 
that ensures the delivery of different packages according to their priority and fairness without any delay. This 
saves battery power by reducing sensors working time. 

The authors also presented various existing real time scheduling schemes which are as follows [ ]: 

1. Dynamic Conflict Free Transmission Scheduling (DCQS): is a query based novel scheduling technique, 
designed to support in network data aggregation and in response to the workload changes it can 
dynamically adapt to the transmission [ 23 , 2 ]. 
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2. Nearest Job Next (NJN): It consists of the mobile element (ME); server and client. The client is the one 
which request the service and it is a simple and intuitive discipline which is adopted by the ME to select 
the next request to be served or the next client [22]. 

3. Traffic Pattern Oblivious Scheduling (TPO): to handle efficiently a wide variety of the traffic pattern by 
using a single TDM A (Time-Division Multiple Access) schedule [27]. 

4. Dynamic Multilevel Priority Packet Scheduling (DMPPS): it consists of three levels of the priority 
queues, the data is placed in the priority queue based on the priority, the last level of the virtual hierarchy 
does not have the priority queue and the levels are formed based on the hop distance from the base 
station [25]. 

5. First Come First Serve (FCFS): it is the simplest packet scheduling algorithm in which packets are 
processed as they come [24]. 

By comparing these schemes, the authors concluded that DMPPS is the better one. They presented a literature 
review table for five articles [ 29,31,28,3( ]. This table shows the objective, key issues, and advantages which 
are summarized by: reducing average energy consumption, balancing the nodes energy consumption, 
minimizing the delay at nodes and increasing network life. They also illustrated the Adaptive Staggered Sleep 
Protocol ASLEEP protocol which is efficient for the power management in wireless sensor network. This 
protocol adjusts dynamically the node sleep schedulers to match the network demand. The node adjusts its 
active period dynamically [ 137]. 

The scheduling scheme proposed in [ ] has three levels of the priority queues; the last level of the virtual 

hierarchy does not have the priority queue. The data packet classification is done as i) real time data given as 
priority 1, ii) non real time remote data, received from the lower level nodes, given as priority 2, iii) non real 
time local data, sensed from the node itself, given as priority 3. The TDMA scheme is used to process the data 
packet sensed by the node which are at the different levels. The conclusion drawn from this paper is that one of 
the advantages of their DMP scheme is its dynamicity to the changing requirement of the Wireless Sensor 
Network application. The proposed threshold value, to check mechanism at the time of the priority level when 
the data arrives at the high priority queue, helps to reduce the deadlock situation. 

In his survey [ 8 ], Nimbalkar introduces Opportunistic Scheduling for effective load balancing in multipath 
traffic network. This is a technique that aims to maximize throughput and packet delivery ratio by exploiting 
short-term variation in path condition. 

Opportunistic Scheduling works to achieve two objectives simultaneously: selecting the user with the best 
channel conditions and satisfying the fairness constraints over long-term scales. Thus, many algorithms are 
developed, as in [ ), 80, 81, 82] which exploit high-quality channels to realise the fairness use of the multiple 
channels available. 

Opportunistic Scheduler takes into account some criteria such as maximum delay, minimum throughput, 
maximum response time, maximum latency in the way to impose fairness on the channels. So, certain 
characteristics should be contained in a good scheduling algorithm such as maximum resource utilization, 
maximum throughput, minimum turnaround time, minimum waiting time, and minimum response time. 

The Opportunistic Scheduling techniques has been studied and classified in five categories. The following 
table (Table VII) summarizes their characteristics. 


TABLE VII 

CHARACTERISTICS OF OPPORTUNISTIC SCHEDULING FOR TRAFFIC NETWORKS WITH MULTIPLE CONSTRAINTS [ 78 ] 


Category 

Approach 

Optimized measures 

Channels / 
scheduling 
Categories 

Performance 

Resolution settings 

Ref. 

Fairness 

Proportional-fair 
sharing approach 

Total throughput 

Two 

competing 

channels 

Maximize total 
throughput 

The channel rate 
history 

[ 140 ] 

Multichannel Fair 

Scheduler 

Optimal throughput 

Multiple 
wireless 
channels / 
Deterministic 

and 

Probabilistic 

Maximize throughput 

This model uses 
adaptive control 
framework to 
develop 

opportunistic fair 
wireless schedulers. 

[ 141 ] 

Indexed to optimal 
solution of 
throughput 

QoS and throughput 

Multiple user 
QoS 

Optimal solution of 
throughput 


[ 79 ] 

Heuristic 
opportunistic 
scheduling policy 

QoS, throughput and 
short-term fairness 

Multiple 

interface 

system 

Throughput 
performance of the 
heuristic policy is 
comparable to that of 
the long- term optimal 
policy 

Heuristic 
opportunistic 
scheduling policy 
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This model uses 
Lyapunov 
optimization 
framework for 
stochastic network 
optimization 

Stronger delay & 
efficient throughput 
utility 

Stochastic 

Opportunistic 
scheduling that 
guarantees a bounded 
worst-case delay 

Network which has 
time-varying 
channel condition 

[ 80 ] 

Delay 

Using radio resource 
allocation 
in OFDMA 

Delay sensitive traffic, 
system capacity and 
throughput fairness 


This model provides 
fairness with respect to 
the realizable 
throughput per user, 
packet dropping ratios 
and packet delay 
distributions 

OFDMA 
(Orthogonal 
frequency-division 
multiple access) 
based network 

[ 81 ] 


An opportunistic 
scheduling based on 
multi-user diversity 
effect 

Total throughput 


Total system 
throughput is 
maximized 

This scheduling 
mechanism can 
result in higher 
spectrum utilization 

[ 82 ] 


A time slotted 
system where time 
is the resource to be 
shared among all 

users 

Maximize the system 
performance 
stochastically 

Multiple 
channels / 
Stochastic 

Users scheduling, to 
transmit slots at each 
time, that optimize the 
network performance. 


[ 108 ] 


Model uses a time- 
slotted system where 
time is the resource 

to be shared 

Throughput value 

Multiple 
channels / 

Stochastic 

The performance of a 
user's channel 
condition by enlarging 
the stochastic process 
value. 


[ 142 ] 

Quality of 
Service 

A model whereby 
the scheduling 
mechanism is based 
on preventing from 
transmitting under 
adverse conditions. 

Maximize user utility 
measure, e.g., 

communication rate 
for efficient utilization 
of the available 

communication 

resources. 

Shared 

wireless 

channels 

Communication rate 

Resources of a 

wireless channel 

network 

[ 143 ] 


Reinforcement 
learning framework 
to design distributed 
adaptive 
opportunistic 
routing problem (d- 
AdaptOR) 

Minimizing the 
average per packet 
cost for routing a 
packet from source to 
destination 

Wireless 

multi-hop 

network 

Packet routing from 
source to destination 

Distributed adaptive 
opportunistic 
routing problem (d- 
AdaptOR) 

[ 144 ] 


Low complexity 
adaptive scheduling 
algorithms 

An identical 
throughput guarantee 

Time-slotted 

networks 

An approximate 
throughput guarantee 

This model develops 
an expression for the 
approximate 
throughput 
guarantee violation 
probability for users 
in time-slotted 

networks 

[ 145 ] 

Throughput 

A simple algorithm 
for networks with 

short- lived flows 

Throughput optimal 

Wireless 

networks with 

short-lived 

flows 

Performance of the 

channel flows 

transmission 

Wireless channel 

network 

[ 146 ] 


Opportunistic 
Multipath 
Scheduling (OMS) 


Multipath 
routing uses 
multiple 
alternative 
paths in the 
network. 

OMS minimize the 
delay and improves 
overall throughput 

Multiple network 
paths 

[ 147 ] 


A new model of 
opportunistic 
scheduling 
mechanism 

Maximizing the 
system overall 
throughput 

Wireless 
network with 
hybrid links 

To avoid starvation of 
the link having a much 
lower transmission 

rate 

Wireless channel 
network 

[ 148 ] 


A model of 

Distributed 
Opportunistic 
Scheduling (DOS) 
under average delay 

Maximize the overall 
throughput 
or the throughput of 
every link according to 
its own individual 

Ad-hoc 

network of 
wireless 

channels 

Optimize the 
throughput 
performance in Ad- 
hoc network 

Homogeneous/heter 
ogeneous scenarios 
with saturated/non- 

saturated stations 

[ 149 ] 
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constraint 

delay constraint 





Distributed 
Opportunistic 
Scheduling (DOS) 

Maximize the 
throughput in a 
network of channels 

with a certain access 
probability 

Wireless 

network 

Improve the 
throughput 
performance of the 
wireless network 

Homogeneous/heter 
ogeneous scenarios 
with saturated/non- 
saturated stations 

[ 150 ] 


In [ 2 ], the authors write "we present a centralized integrated approach for: 1) enhancing the performance of 

an IEEE 802.11 infrastructure wireless local area network (WLAN), and 2) managing the access link that 
connects the WLAN to the Internet. Our approach, which is implemented on a standard Linux platform, and 
which we call ADvanced Wi-fi Internet Service EnhanceR (ADWISER), is an extension of our previous system 
WLAN Manager (WM). AD WISER addresses several infrastructure WLAN performance anomalies such as 
mixed-rate inefficiency, unfair medium sharing between simultaneous TCP uploads and downloads, and 
inefficient utilization of the Internet access bandwidth when Internet transfers compete with LAN-WLAN 
transfers, etc. The approach is via centralized queueing and scheduling, using a novel, configurable, cascaded 
packet queueing and scheduling architecture, with an adaptive service rate." 

In the objective of managing inter-cell interference with a centralized controller, Ramos-Cantor et al 
addressed in [ 16 ] the problem of coordinating scheduling decisions among multiple base stations in an -LTE- 
Advanced downlink network. In order to solve the coordinated scheduling problem an integer non-linear 
program is formulated. It only makes use of the specific measurement reports defined in the 3GPP standard. 
Unlike most existing approaches, it does not rely on exact channel state information. The authors proposed an 
equivalent integer linear programming reformulation of the coordinated scheduling problem, which can be 
solved efficiently by commercial solvers. The performance of the proposed coordinated scheduling approaches 
is analyzed by extensive simulations of medium to large size networks. The available analytical results show 
fundamental limits in cooperation due to interference outside the cluster. 

A centralized scheduling approach was proposed in [ ] to manipulate centralized coordination among 

heterogeneous agents. In this study, the center agent acts as an information collector, processer and resource 
scheduler. The main contribution of this center is enacts centralized scheduling to run well. A clustering analysis 
based on artificial immune algorithm is applied to process information, moreover a series of schemes are 
suggested to ensure smooth scheduling. 

In [ [ 3 ], it is indicated that the data-scale computing for analytical workloads is becoming increasingly 
popular. Due to the high operational costs heterogeneous applications are forced to share cluster resources to 
achieve economies of scale. Existing approaches of scheduling large and diverse workloads are tackled in two 
alternative ways: (1) the solutions offer strict, secure enforcement of scheduling invariants (fairness, capacity) 
for heterogeneous applications; and (2) the distributed solutions offer scalable, efficient scheduling for 
homogeneous applications. The authors proposed Mercury, a hybrid resource management framework that 
supports the full spectrum of scheduling, from centralized to distributed. Mercury exposes a programmatic 
interface that allows applications to trade-off between scheduling overhead and execution guarantees. The 
authors stated that their framework harnesses this flexibility by opportunistically utilizing resources to improve 
task throughput. Also, experimental results on production-derived workloads show gains of over 35% in task 
throughput. These benefits can be translated by appropriate application and framework policies into job 
throughput or job latency improvements. They have implemented and contributed Mercury as an extension of 
Apache Hadoop / YARN. 

Fu et al introduced in [ L1 ] En-Omega. This is a novel hierarchical hybrid design of schedulers to address the 
serious job starvation problem that triggers especially in heavily loaded clusters. In En-Omega the fully 
distributed schedulers can be enhanced with a central scheduler. This can provide a global fairness to the jobs 
from different schedulers and simultaneously reduce the average latency of all the jobs sharply. To reduce the 
overhead, in En-Omega design, the central scheduler will be activated only when the cluster is heavily loaded. 
Furthermore, the cache used for central queuing and the scoring policy used in central scheduling are all load- 
aware. En-Omega was evaluated based on Google trace and experimental results show that, compared to the 
baseline design, this method can reduce the average latency of starving jobs up to 90% with reasonable 
overhead. 

In [ , 8 ], the authors pointed out that the traditional distributed wireless video scheduling is based on perfect 
control channels where instantaneous control information from the neighbors is available. They mentioned the 
difficulty to obtain this information in practice, especially for dynamic wireless networks. They found that the 
two approaches - distortion-minimum scheduling aiming to meet the long term video quality demands and the 
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other that focusing on a minimum delay - can't be applied directly. Then they went to investigate the distributed 
wireless video scheduling with delayed control information (DCI). First, they translate this scheduling problem 
into a stochastic optimization rather than a convex optimization problem in the way to exploit in a tractable 
framework. Next, they consider two classes of DCI distributions to study the relationship between the DCI and 
scheduling performance, and provide a general performance property bound for any distributed scheduling. 
These classes are: i) the class with finite mean and variance, and ii) a general class that does not employ any 
parametric representation. Thereafter, it will be created a class of distributed scheduling scheme to achieve the 
performance bound by making use of the correlation among the time-scale control information. The main 
contributions are presented at the theoretical and technical level. 

1. Theoretical level: an appropriate Lyapunov function based on observed DCI is presented to establish 
the scheduling performance in terms of DCI. This leads to the positive Harris recurrence property of 
the network Markov process. This represents the most challenging part of this work since it needs to 
prove an effective time scale separation between the network state dynamics and scheduling decision 
dynamics. To make this possible, they design an increase function of queue-size to capture the 
correlation of the DCI. 

2. Technical level: a distributed video scheduling scheme in terms of DCI is proposed. This scheme only 
utilizes local queue-length information to make scheduling decisions. Also, it only requires each node 
to perform a few logical operations at each scheduling decision. 

The authors concluded that they provided a general performance property bound for any distributed 
scheduling. Importantly, they designed a class of distributed online scheduling scheme to achieve the optimal 
performance bound by making use of the correlation among the time-scale control information. 

Some other surveys of scheduling problems in mobile networks, whose important characteristics are 
summarized in Table VIII, are: Context, Environment, Conclusion and Suggestions. 

TABLE VIII 


PROPERTIES EXTRACTED FROM SOME ADDITIONAL SURVEYS OF MOBILE NETWORKS SCHEDULING PROBLEMS 


Authors / 

Reference 

Area 

Context 

Scheduling problems environment 

Conclusion and Suggestions 

Akashdeep et 
al. / [ 104 ] 

Mobile 

networks 

Point-to- 
Multipoint 
(PMP) and 
mesh mode 

for wireless 

broadband 

access in 
networks. 

Scheduling techniques for IEEE 
802.16 networks in PMP 

Mode and their approaches which 
may be divided into sub categories 
such as Traditional, Hierarchal, 
Cross Layer Approaches, Dynamic 
Schedulers and Soft Computing 
based. 

There are some of the areas still not quite 
explored namely the application of soft 
computing/optimization techniques like Genetic 
Algorithm, neural networks, fuzzy logic etc. 
Using these approaches together with 
information from higher layers can act as a major 
contributor in the field of scheduling. 


5 DISCUSSION 


Through recent surveys presented in the present paper and others more ancient, we conclude that the 
scheduling problem has been developed by balancing tasks on one machine, on a few machines and then over 
very large number of machines. This is developed with time, over decades, from heavy manufacturing to various 
areas of light industries passing recently in networks of mobile devices going to Cloud, Fog and Edge 
computing. 

There are many different algorithms designed by researchers or adopted from several domains, especially from 
mathematics, and then developed to solve scheduling problems. In this paper, we are investigating their 
advantages in terms of efficiency and speed of achieving the desired optimal scheduling policy. 

Based on what was published, we find that the criterion to meet is to accomplish the desired job efficiently in 
the shortest possible time, taking into account the capacity of equipment used in terms of energy and treatment. 
It is obvious that researchers prefer an exact optimal policy as a resolution approach to solving scheduling 
problems. But, if this policy has a great complexity and requires more effort, energy and time, researchers aim 
to find another policy close to it simpler and more economy. This is the case of small mobile devices in 
networks that support the latest technologies, for example Cloud and Fog Computing. Especially, for the 
communication systems of ubiquitous resources that have several challenges still cannot be accommodated such 
as: high energy consumption, the continuously increasing demand for high data rates and the mobility required 
by new wireless applications. So, according to the fact that these devices suffered of poor energy and storage or 
computing capacity, the trend is the use of heuristic and meta-heuristic methods. 

We mention the success achieved through the use of Cloud computing in solving the problems of 
communication as well as the storage and transfer of data. Then, the use of Fog and Edge computing to solve the 
problems experienced by Cloud computing. But, we find that these means become ineffective in the case of 


63 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 








International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


heavy and excessive communications or the case of natural or terrorist disasters that may destroy their service 
centers. Therefore, we believe that the solution in these cases is the network of ubiquitous resources and mobile 
devices that can be established anytime and anywhere. This can also be used in normal situations because of 
their very low cost and their permanent presence with us. 

6 CONCLUSION 

Scheduling and load balancing problems are the most important issues for sharing tasks between machines or 
devices. The most recent systems, such as Internet objects, cloud computing, fog and edge, whose main issue is 
sharing tasks, are rapidly evolving because of technological advances and excessive demand for their use. 

In this paper, we presented a collection of recent surveys and articles that are interested in scheduling 
problems. We presented different scheduling issues, and investigated their advantages and disadvantages based 
on many criteria, such as context, environment, optimizing function, used algorithms, suggestions and proposed 
improvement. This may give rise to a document which summarizes most recent researches in this field in order 
to present a comprehensive idea of scheduling problems. 

The main objective of this effort is to facilitate the work of researchers and readers who investigate this model 
as well as those working in this field. Also, to attract attention to enhance and thus take advantage of newly 
emerging areas that are in dire need of scheduling such as Cloud, Fog and Edge computing, especially the 
network of ubiquitous resources and mobile devices. 

As a result, we have concluded that efforts in this area focus on optimizing the desired measures and achieving 
jobs in the shortest possible time. But, this seems to be unattainable by using exact and optimal policies of 
scheduling and load balancing tasks on the available devices. This is due to the algorithms complexity that gives 
these exact solutions. Therefore another way is adopted, in the literature, by means of heuristic and approximate 
algorithms which search optimal policies close to the exact ones, which decreases considerably the execution 
time. We recommend and urge to focus the future works to develop this type of algorithms. 

Finally, we aim to attract researchers to enhance scheduling issues in the individual mobile networks segment 
of collaborative mobile computing. The main factor behind this trend is to create a practical alternative to the 
applications of information, communications, task scheduling and many more, for example in cases of disasters 
and terrorism when it becomes impossible to move from place to another and use the centers dedicated to these 
applications. Another reason, to be realized, is the facilities and benefits that these issues can provide to users in 
terms of implementing and executing their applications, and thus managing their various jobs, through mobile 
networks at any time and wherever they are. This is for nowadays, but for years to come, we claim that they will 
be the de facto devices due to the anticipated future technological improvements. 
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Abstract — Education is a basic need for every human being and 
digital education is the current trend and necessity for every 
students or learners to be more focused in their learning. In this 
paper authors worked with these current phenomena. Digital 
education helps students or learners to gather knowledge in 
easier and different ways than before. It also reduces the learning 
time. In traditional education system we were mostly dependent 
on text book or in instructor’s speech. But nowadays it is easier to 
find any text book or any other learning materials by using 
digital educational tools. Another charismatic change that 
transformed human life is social networking. In terms of digital 
education social networks contribute a good portion of education. 
Among social networking services, Facebook has become most 
popular for communication with familiar and unfamiliar 
persons. The impact of the use of Facebook on students is very 
impactful. In this paper authors conducted a survey on various 
students for understanding the digitalization effect on 
educational purpose. Machine learning was applied for classified 
the happy and unhappy student with digitalization where focused 
time spent on educational purposes. Finally authors provided an 
analytical summary of digitalization effect on education based on 
their survey. 

Keywords — Digitalization , E-learning , Machine learning , Social 
networking. 

I. Introduction 

The history of Internet began with the development of 
electronic computers in 1950s. Like many other developed and 
developing countries, Internet in Bangladesh has also 
witnessed phenomenal growth. Although facing many 
constraints in expanding internet access and uses, development 
of the internet and information technology is one of the 
government’s high priorities. In present era, social networking 
websites, such as MySpace and Facebook have been attracting 
a large number of participants. For example- in 2013, Internet 
users in Bangladesh increased to 33 million [1]. 

In social networks, each node represents a participant and 
each link between participants corresponds to real-world 
interactions or online interactions between them. One 
participant can give a trust value to another based on their 
interactions [2]. Online social networking sites have been 
attracting a large number of participants and are being used as 
the means for a variety of rich activities. For example, 
participants carry out business, and share photos and movies 
on the first generation (e.g., ebay.com) and second generation 
(e.g., facebook.com) social networking sites respectively [3]. 
Authors tried to find out the percentage of participants that 


actually uses these social networking sites for digital education 
purpose. They also tried to determine is these sites really 
helping our student participants to gain more quicker and clear 
knowledge to grow as a learner or are they just wasting their 
time leading them to a poor career. 

This paper is organized as follows: Section II gives a brief 
historical idea about digitalization’s activity. Section III for a 
small overview of this paper work. In section IV discuss about 
result analysis. Section V includes the conclusion with future 
plan. 

II. Related Work 

In this section noted some previous discussion about 
digitalization, social networking and machine learning with 
digital education. 

Umamaheswari. k and S. Niraimathi worked with student’s 
socio-demographic variables like age, gender, name, class 
grade, proficiency and extra skill. Their data analysis result 
helps recruitment process on interview board through 
student’s grade [7]. Sunday Tunmibi, Ayooluwa Aregbesola, 
Pascal Adejobi, and Olaniyi Ibrahim was discussed the impact 
of e-learning and digitalization in primary and secondary 
school levels. They showed that majority of teachers agreed 
about e-learning helps our students to gather more knowledge 
and resources [4]. Manoj Kumar discussed on smart phone 
uses in education technology. He was also worked with the 
application of technical and professional studies in Indian 
education [5]. Pooja Thakar, Anil Mehta, and Manisha 
discussed about educational data mining which was based on 
different survey results. Machine learning helps us to find out 
informative information to solve a problem [6]. Radhika R 
Halde was introduced machine learning approach for 
predicting the student’s performances and also compared 
different machine learning algorithms [12]. 

One of the digitalization impact of Bangladesh is 1 to class 
10 including teacher’s training and other necessary books are 
available at this website www.ebook.gov.bd website. The 
government provides laptops and multimedia projectors to 
20,500 public and private educational institutions to improve 
the classroom teaching-learning process [18]. 

There are many online social networking sites like 
Facebook, Twitter, MySpace, eBay etc. Among all of these 
sites, participants of Facebook are the highest. Facebook is an 
online social networking service. Its name stems from the 
colloquial name for the book given to students at the start of 
the academic year by some American university 
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administrations to help students get to know each other [8]. 
Facebook (as of 2012) has about 180 petabytes of data a year 
and grows by over half a petabytes every 24 hours [9]. 

After this study here author worked with digitalization 
impact on education and they discussed the positive and 
negative answer of student’s. Finally author provided a result 
which was focused the reason of this survey. 

III. Overview 

Social networking is an internet based medium that makes 
a way to communicate with friends, family, classmates, 
customers and clients. Social networking can occur for social 
purposes, business purposes or both through sites such as 
Facebook, Twitter, and Linkedln etc. 1131. But nowadays 
social networking also used for educational purposes. 
Facebook is the most popular social networking site. Facebook 
was founded in February 2004 by Mark Zuckerberg with his 
college roommates. As of September 2012, Facebook has over 
one billion active users [14]. There are many Facebook pages 
and groups that help us to gather information in different 
reasons. Finally it helps us to increases our knowledge besides 
social communication. 

Machine learning algorithms can be categorized under two 
main streams: supervised learners and unsupervised learners 

[16] . The program is trained with a pre-defined set of training 
examples, which then facilitate its ability to reach an accurate 
conclusion when given new data is the supervised learning and 
Unsupervised machine learning is the program where given a 
bunch of data and must find patterns and relationships therein 

[17] . In this paper author focused on supervised learning 
approaches. 

Digital Technology for education is defined with any 
process where the teacher or learner uses digital equipment 
such as a personal computer, a Laptop, tablet, MP3 player, or 
console to access digital tools such as learning platforms and 
virtual learning environments (VLEs) to improve their 
knowledge and skills. The Learning with Digital Technology 
comprises of ICT products such as teleconferencing, email, 
audio, television lessons, radio broadcasts, interactive voice 
response system etc. [19]. Day by day all over the world go 
through with digitalization and our education system is one of 
the biggest fields where we can introduce more digitalization. 

IV. Data Analysis and Result 

Authors have conducted an online survey on the basis of 
Internet access among several students of different educational 
institutions throughout Bangladesh [10]. Here collected 
necessary real time data on the basis of some questions from 
students of corresponding regions. Here considered responses 
from 283 students. All of the participants use Internet at 
different time. In this paper author also introduced machine 
learning to classify the student’s based on the two fields- 
Gender and Happy/Unhappy with digitalization (H/UH). This 
paper also realized the male and female student’s interest on 
digitalization. In this section discussed five issue related with 
the survey and those actually focused on digital education 
effect on education. 


A. Effectiveness 

Although the traditional education in our country i.e. 
writing on the board with chalk or white board marker is still 
preferable to many of the teachers, our student’s thoughts are 
otherwise. On our survey one of the most important questions 
was the thought of our students about the effectiveness of 
Digital Education. A maximum of number of 95.4% student 
are happy while the teachers uses digitization tools inside the 
classroom. Only 4.6% students stayed with the unhappy 
group. As our survey result is suggesting most of our students 
are changing and loving digitization so our teachers should 
keep that fact in mind and prepare their content based on that. 

#Yes 
• No 


Figure 1. Effectiveness of digitalization. 



Another aspect of the effect of digitization tools inside the 
classroom was found when author raised a question in their 
survey that is “Do you think these digitized tools are effecting 
your classroom studying?” the answer was varying with the 
happy unhappy ratio. Author found that 83.4% agrees with the 
positive effect of digitalization inside the classroom while 
16.6% of students said that traditional system is not good or 
bad then the digitalized education. In this section author found 
that students have different thinking about digitalization on 
classroom. Some students have no idea about practices of 
digital tools for educational purposes. 



Figure 2. Digitalization effect on classroom study. 


B. Social Network 

After the vision 2021 of current Bangladesh Govt, in 2008 
of a digital world many of our pupil is now online and a large 
portion of this online activity is based on social networks, 
mostly Facebook, Twitter, WhatsApp and many others. 
Nowadays social networks contribute a good portion of 
Internet traffic and thus attract tremendous research interests. 
Our students carry a large participation in this activity. In this 
survey result we found 95.3% students are fond of Facebook 
and 20.4% of twitter in their social networking activities. 
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Only a mere of 1.1% are fond of YouTube, while there are 
some other social network users in our survey. We also found 
that students are spending more than 2-3 hrs for social 
networking on daily basis. 


Social Network Uses 
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Figure 3. Social Networking. 


C. Gender 

Since 1971 population of Bangladesh is rising in an 
alarming rate. According to the population structure in 
Bangladesh male percentage per 100 female of an age of 20- 
24 is 76.8% [11]. But as our survey result shows that female 
student percentage is less than 20%. It is very concerning fact 
for our women education and development because in our 
country most of the family’s children are growing up in a 
mother centric environment. And it is still believed that “An 
educated mother can provide an educated nation”. In this 
survey we found that the percentage of female student’s 
is 18.4% that is not too good in our nation. 



• Female 

• Male 


Figure 4. Percentage of Gender. 

D. Education Time 

Our survey was based on digital education and its effect, 
where one of the key question was “Per day how much time 
do you spend for education by using Internet?” The result of 
this question is so impressive that the average spending time 
for education purpose is more than 3hrs. Most of our students 


like to spend this time on Google, YouTube, Wikipedia and 
others. In table we show the survey result for this scenario. 



Figure 5. Hours spend for educational purposes per day. 

E. Helpful tools inside class room 

In this digital era student like to learn in their classroom by 
using power point presentation slide, YouTube, Google 
classroom and few others. Based on our survey result we 
found that 85.4% students like to use these digital tools inside 
classroom. 


Usellil Websites for Education 
150 

100 

50 

0 

Useful Sites for Education 


Figure 6. Useful website for education. 



F. Machine Learning and Survey 

In this section author was used Weka 3.8.0 for understood 
the male and female students ration about happy with 
digitalization effect and also here we found that the ration of 
male and female in our graduation level of education. In 
Figure 7 represent the decision tree. 



= Female = Male 



Yes (51.000) 



Figure 7. No. of Happy participant with digitalization. 


Table 1 for presented the total discussion result at a glance. 
Here focused the above five terminology- Effectiveness, 
Social Network, Gender, Education Time and Helpful website 
and tools inside class room. 
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TABLE I. Survey Result. 


Field 

Survey Result 

Effectiveness 

Digitalization 

Y-95.4% 



N-4.6% 


Digitalization on 

Y-83.4% 


classroom 

N-16.6% 

Social Network 

Facebook 

Gender 

Male-81.3% 



Female-18.4% 


Education Time 

Average 3 Hr s/Day 

Helpful website & 

Google.com 


tools 

Youtube.com 



Power point presentation 


V. Conclusion 

Today most of the learning styles have been converted into 
digital education system. Digital education also extends 
through social networks. Here the survey result showed the 
effect of social networking in our education site. So in future 
we can work with e-leaming through Facebook that will more 
interactive and easy to access. If this study attract the 
concerned authority, then it will be helpful for all of the 
learners who use Facebook across the world. In future, authors 
have a target to work with machine learning approaches where 
they consider digital education system. 
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Abstract — In remote sensor arrange messages are 
exchanged between the different source and goal matches 
agreeably such way that multi-jump parcel transmission is 
utilized. These information bundles are exchanged from the 
middle of the road hub to sink hub by sending a parcel to goal 
hubs. Where each hub overhears transmission close neighbor 
hub. To dodge this we propose novel approach with proficient 
steering convention i.e. most brief way directing and conveyed 
hub steering calculation. Proposed work additionally 
concentrates on Automatic Repeat Request and Deterministic 
Network coding. We spread this work by the end to end message 
encoding instrument. To upgrade hub security match shrewd key 
era is utilized, in which combined conveying hub is allocated with 
combine key to making secure correspondence. End to end. We 
dissect both single and numerous hubs and look at 
basic ARQ and deterministic system coding as strategies for 
transmission. 

Keywords: SINK, Mesh Network , Sensor Deployment. 

I. Introduction 

In multi-jump remote system parcel transmission by 
safeguarding the privacy of transitional hubs, with the goal 
that information sent to a hub is not shared by some other hub. 
Additionally, in which secrecy is a bit much, it might be not 
secure to consider that hubs will dependably remain 
uncompromised. In remote system hubs, information 
secret can be seen as a security to stay away from a traded off 
hub from getting to data from other uncompromised hubs. In a 
multi-bounce organize, as information parcels are exchanged, 
middle of the road hubs gets all or part of the information 
bundle through straightforward transmission of system hub by 
means of multi-jump arrange mold, while exchanging 
classified messages. Proposed work alludes productive 
calculations for secret multiuser correspondence over multi¬ 
bounce remote systems. The metric we use to quantify the 
privacy is the shared data spillage rate to the transfer hubs, i.e., 
the equivocation rate. We require this rate to be self- 
assertively little with high likelihood and force this in the asset 
allotment issue by means of an extra limitation. We 


consider down to earth postpone necessities for every client, 
which wipes out the likelihood of encoding over 
a discretionarilylong piece. 

II. PROBLEM STATEMENT 

Proposed system present the problem of network 
utility maximization, into which confidentiality is 
incorporated as an additional quality of service constraint. 
Secure message transmission between the source and a 
destination node with less overhead cost. Data transfer using 
multi-hop with minimum overhead and secure communication 
among network node. Proposed system resolve problem of 
distributed scheduling. Cross-layer node allocation problem 
with confidentiality in a cellular wireless network, where 
users transmit information to the base station, confidentially 
from the other users. 

III. LITERATURE SURVEY 

This system proposed private and public channels 
to minimize the use of the (more expensive) private channel in 
terms of the required level of security. This work considers 
both single and multiple users and compares simple ARQ and 
deterministic network coding as methods of transmission 
[lj.This paper design secure communications of one source- 
destination pair with the help of multiple cooperating 
intermediate nodes in the presence of one or more 
eavesdroppers. Three Cooperative schemes are considered: 
decode-and-forward (DF), amplify-and-forward (AF), and 
cooperative jamming (CJ). For these schemes, the 
relays transmit a weighted version of a re-encoded noise-free 
message signal (for DF), a received noisy source signal (for 
AF), or a common jamming signal (for CJ)[2].This paper 
considers secure network coding with nonuniform or restricted 
wiretap sets, for example, networks with unequal link 
capacities where a wiretapper can wiretap any subset 
of links, or networks where only a subset of links can 
be wiretapped [3].The scheme does not require eavesdropper 
CSI (only the statistical knowledge is assumed) and the secure 


Identify applicable sponsor/s here, (sponsors) 


74 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 





(IJCSIS) International Journal of Computer Science and Information Security, 


throughput per node increases as we add more legitimate users 
to the network in this setting. Finally, the effect of 
eavesdropper collusion on the performance of the proposed 
schemes is characterized [4].We characterize the secrecy 
capacity in terms of generalized eigenvalues when the sender 
and eavesdropper have multiple antennas, the intended 
receiver has a single antenna, and the channel matrices are 
fixed and known to all the terminals and show that a 
beamforming strategy is capacity-achieving. In addition, we 
study a masked beam forming the scheme that radiates 
power isotropically in all directions and shows that it attains 
near-optimal performance in the high SNR regime [5]. 

IV. SYSTEM ARCHITECTURE 

In existing hop to hop communication in wireless 
sensor network considered to succumb to the vulnerability of 
data transmission. Due to hop by hop communication 
increased cost for packet transmission, the existing system 
uses security mechanism as a node to node authentication 
among network resources. Hop to hop identity of intermediate 
node compromise security threats. To avoid security threat 
they use digital signature authentication at node level for 
communication or packet transmission. In the existing system, 
message transmission is done through all neighbors between 
the source and destination nodes, which result in overhearing 
and increase overhead between nodes. Also, it leads to 
compromised node communication in wireless sensor 
communication. 


5 



Figure. 1. Proposed System ( Architecture) and Working. 

Proposed system implements an optimal dynamic 
policy for the case in which the number of blocks across 
which secrecy encoding is performed is asymptotically large. 
Next, to that, This work propagate encoding between a finite 
number of data packets, which removes the possibility 
of achieving perfect secrecy. In this case, proposed work 
design a dynamic policy to select the encoding rates for every 
data packet, based on the instantaneous channel state 
information, queue states and secrecy humiliation 
requirements. By numerical analysis, we observe that the 
proposed scheme approaches the optimal rates asymptotically 
with increasing block size. 


Vol. 16, No. 1, January 2018 
Finally, we address the consequences of practical 
implementation issues such as infrequent queue updates and 
de-centralized scheduling. Existing work present the 
efficiency of our policies by numerical studies under various 
network conditions. Next to this work proposed 
system contribute to deterministic network coding Automation 
of repeat packet request mechanism to actively transfer data 
packet. This help to network costs and other 
system parameters were just designed as constants in our work 
the network costs are related to physical layer parameters such 
as channel encoding parameters and transmission power. Here 
proposed system design in the way, which formulate problem 
by adding noise to original message or request at a destination. 

The proposed system also 

formulate problem ARQ case in which automatic 
repeat request is sent between numbers of the time slot during 
packet sending. Where packets are generally transferred 
via the private channel and public channel from source to 
destination. These packets are generally geometrically 
distributed among network nodes. 

V. ALGORITHM DETAILS 

A. Generate an RSA key pair 
Input : Required modulus bit length, k. 

Output : An RSA key pair ((N,e), d) where N is the modulus, 
the product of two primes (N=pq) not exceeding k bits in 
length; e is the public exponent, a number less than and 
coprime to (p-l)(q-l); and d is the private exponent such that 
ed = 1 (mod (p-l)(q-l)). 

Select a value of e from { 3, 5, 17, 257, 65537 } 
repeat 

p <— genffiprime(k/2) 
until (p mod e) ^ 1 
repeat 

q <— genffiprime(k - k/2) 
until (q mod e) f 1 
N <— pq 
L <- (p-l)(q-l) 
d <— modffiinv(e, L) 

Return (N, e, d) 

The system has classified into the different sets like below 
Sys={inp, process, out, analysis} 

Inp= {D1,D2.Dn} 

That is the set of input data chunks 

m 

EncData Enc( D)... Enc( Dn) -(1) 

n =1 

Equation (1) shows the data aggregation as well as data 
encryption process. 
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Data=XD[i] —- (2) 

Equation (2) shows the get the data from each node 

m 

PlainData V Dec (D)... Dec (Dn ) -(3) 

n =1 

Equation (3) shows the data aggregation of cipher data on 
receiver phase with decryption process. 

B. Construction of updated BTC 

Input: Initial source node sn, Destination node dn, Group of 
neighbor nodes nd [], each node id, each node energy eng. 
Output: Source to destination path when data received 
success. 

Step 1: User first selects the sn and dn 

Step 2: choose the packet or file f for data transmission. 

Step 3: if(f!=null) fd<= f 

Step 4: read each byte b form fd when reach null 

Step 5: send data; initialize cfl, cf2, pfl, pf2. 

Step 6: while (nd[i] when reach NULL) 

Cfl=nd[i].eng 
Pfl= nd[i].id 
Cf2=nd[i+l].eng 
Pf2= nd[i+l].id 
Step 7: if (cfl>cf2) 

Cf2=null 

Pf2=null 

Else 

Pfl=pf2 

Cfl=cf2; 

Pf2=null 
Cf2=null 
Step 8: end while 

Step 9: repeat up to when reach at sink node. 

VI. Experimental setup 

We run our experiments in NS2 simulator version 
2.35 that has shown to produce realistic results. NS simulator 
runs TCL code, but here use both TCL and C++ code for 
header input. In our simulations, we use Infrastructure based 
network environment for communication. For providing 
access to the wireless network at any time used for the 
network selection. 

WMN simulate in NS2.TCL file shows the 
simulation of all over architecture which proposed. For 
run.TCL use EvalVid Framework framework in NS2 simulator 
it also helps to store running connection information message 
using connection pattern file usl. NS2 trace file .tr can help to 
analyze results. It supports filtering, processing and displaying 
vector and scalar data. The results directory in the project 
folder contains us.tr file which is the files that store the 
performance results of the simulation. Based on us.tr file using 
the xgraph tool we execute graph of result parameters with 
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respect to x and y-axis parameters. Graphs files are of .awk 
extensions and are executable in the x-graph tool to plot the 
graph. 

A. Types of simulation 


Parameter 

Value 

Simulator 

Ns-allinone-2.35 

Simulation Time 

40sec 

Channel Type 

Wireless Channel 

Propogation Model 

Propogation Two Ray Ground 

Medium 

Phy/Wireless Phy 

Standard 

Mac/802 11 

Logical Link Layer 

LL 

Antenna 

Antenne/Omni Antenna 

X dimension of the 
topography 

1500 

Y dimension of the 
topography 

1000 

Max Packet in ifq 

1000 

Adhoc Routing 

AODV 

Routing 

DSR 

Traffic 

cbr 


Table 1. Behaviour of parameters versus Simulation time for 
Different Nodes . 

These Parameters are defined and evaluated below: 

B. Average End-to-End Delay 

End-to-End Delay (E2ED) refers to time occupied by 
a data packet travel from a source to a destination in a 
network. Here only data that reaches successfully to the 
destination are considered. The minimum value 

of E2ED means the good performance of the protocol. The 
smallest amount value of end-to-end delay states superior 
performance of the protocol. 

C. Packet Delivery Ratio 

The packet delivery ratio (PDR) defined as a ratio of 
numbers of data packets reached to target over the network 
to a number of packets generated. The greater amount value of 
packet delivery ratios states superior performance of the 
protocol. 

D. Throughput 

Throughput can be defined as the ratio of the total 
bytes in data packets received by sink nodes to time from first 
packets generated at a source to the last packet received by 
sink nodes. The greater value of throughput states superior 
performance of the protocol. 

E. Energy Cosumption 

Energy consumption is most important concepts in WSN. The 
lifetime of the sensor network is based on energy consumption 
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of the sensor node. Total energy consumption of the node 
defined as the difference between initial energy 
and final energy of the node. The smallest amount value of 
energy consumption states superior performance of the 
protocol. 

F. For lOOno of Nodes 

1) Delay versus Smulation Time 

The end-to-end delay in SINGLE HOP, DUAL HOP 
and DDT with an increase in Simulation time. However, 
increasing treads in DUAL HOP and SINGLE HOP is much 
higher than Proposed as shown in Table 2. The smallest 
amount value of end-to-end delay states superior performance 
of the protocol. Figure 2 shows, the proposed system gives 
superior performance than other three protocols. 


Delay 

Simulation 

Time 

Multi Hop 
Proposed 

Single Hop 

Dual Hop 

Distributed 

Data 

Transmission 

(DDT) 

0.15 

0.00562 

0.00752 

0.00622 

0.00803 

0.20 

0.00578 

0.00782 

0.00653 

0.00901 


Table 2. Delay of 100 Nodes. 



2 ) Packet Delivery Ratio versus Simulation Time 

The packet delivery ratio of SINGLE HOP, DUAL 
HOP, and DDT than proposed system decreases with increase 
in Simulation Time as shown in Table, However, decreasing 
treads in SINGLE HOP and DUAL HOP is much 
smaller than proposed approach. The greater amount value of 
packet delivery ratios states superior performance of the 
protocol as shown in Fig 3. 
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PDR 

Simulation 

Time 

Multi Hop 
Proposed 

Single Hop 

Dual Hop 

Distributed 

Data 

Transmission 

(DDT) 

0.15 

95.20 

90.20 

92.45 

95.10 

0.20 

95.15 

90.40 

91.30 

96.03 


Table 3. PDR of 100 Nodes. 



Fig. 3. PDR versus Simulation Time. 


3) Throughput versus Simulation time 

Figure 4 shows the throughput under different 
networks scale in DUAL HOP, SINGLE HOP, DDT and 
Multi-Hop. The throughput in proposed, SINGLE HOP, DDT 
and DUAL HOP increases with increase in Simulation Time. 
The greater value of throughput states superior performance of 
the protocol as shown in Table 4. 


Throughput 

Simulation 

Time 

Multi Hop 
Proposed 

Single Hop 

Dual Hop 

Distributed 

Data 

Transmission 

(DDT) 

0.15 

196.20 

189.20 

183.45 

179.10 

0.20 

194.15 

188.40 

184.30 

181.03 


Table 4. Throughput of 100 Nodes. 



4) Energy versus Simulation time 

The energy consumption of DUAL HOP , SINGLE 
HOP , DDT and Hybrid DUAL HOP decreases with increase 
in Simulation Time . However, decreasing treads in DUAL 
HOP and Proposed approach is much higher than SINGLE 
HOP , DDT as shown in Table 5. The smallest amount value 
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of energy consumption states superior performance of the 
protocol as shown in Fig 5. 


Energy 

Simulation 

Time 

Multi Hop 
Proposed 

Single Hop 

Dual Hop 

Distributed 

Data 

Transmission 

(DDT) 

0.15 

755 

1120 

1320 

1760 

0.20 

956 

1293.40 

1570 

1985 
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Fig. 6. System comparison proposed vs existing ( miliseconds ). 


Table 5. Energy required for simulation of 100 Nodes. 



-Single Hop 

-Dual Hop 

DDT 

-Multi Hop 


Fig. 5. Energy versus Simulation Time. 

5) Accuracy of System 

In order to evaluate the performance of system 
performed. The network architecture considered is the 
following: 

• A fixed base station (sink node) is located away from 
the sensor field. 

• The sensor nodes are energy constrained with 
homogeneous initial energy allocation. 

• Each sensor node senses the surroundings at a fixed 
rate and at all times its data to send to the base 
Station (data are sent if an event occurs). 

• The sensor nodes are assumed to be stationary. 
However, the protocol can also support. 

We compare the proposed system results with different 
existing system. Below table shows the comparison analysis of 
proposed system with some existing system 


We consider energy evaluation for transmission which will 
conserve the node energy at the time of transmission, the 
system will select efficient path for communication with 
neighbor node at same time remaining network will sleep 
node. 


VII. CONCLUSION 

Secure and effective way reproduction for parcel 
misfortunes and in addition directing progression. At the hub 
side, Pathfinder is an instrument which has a connection 
between's an arrangement of bundle ways and productively 
packs the way data utilizing way distinction. At the sink side, 
Pathfinder deduces parcel ways from the compacted data and 
utilizes astute way theory to reproduce the bundle ways with 
high remaking proportion. 

Straightforward Automatic Repeat Request (ARQ), and 
Deterministic Network Coding (DNC), where in each vacancy 
the source shapes M directly autonomous deterministic blends 
of the M parcels and afterward utilize basic ARQ to transmit 
each straight mix dependably to the goal. We expect for this 
situation that the collector does not make an induction from 
the got straight blends but rather either disentangles the 
transmitted bundles or not. 
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Abstract — Online Social Networks have become a prominent mode of communication and collaboration. Link 
Prediction is a major issue in Social Networks. Though ample methods are proposed to solve it, most of them take a static 
view of the network. Social Networks are dynamic in nature, this aspect has to be accounted. In this paper we propose a 
novel predictor LCF for Link Prediction in dynamic networks. In this method we view Social Networks as sequence of 
snapshots, each snapshot is the state of the network of a particular time period. Each edge of the network is assigned a 
weight based on its time stamp. We compute the LCF score for all node pairs in the network to predict the associations 
that may occur at a future time in the Social Network. We have also shown that our predictor outperforms the standard 
baseline methods for Link Prediction 


I. Introduction 

The digital Social Media has brought about a revolution to Mankind. It has drastically changed the way people 
connect, communicate and collaborate. Today people meet, chat, discuss , debate and even do business through Social 
Media. Online Social Networks(OSN) such as Facebook, Twitter, Instagram, Flickr facilitate these interactions. The 
exponential growth of these OSN's has opened up new arenas of research. Social Network Analysis(SNA) is field of 
research which deals with the tools and strategies for the study of social networks. Link Prediction(LP) is one among 
the various problems that has been addressed by SNA. Link Prediction is a task of predicting future interactions that 
may occur in the social network. Fig. 1 Illustrates the problem, in the friendship network shown in the figure we need 
to predict if any of them who are not friends presently could connect with each other and become friends at a future 
time. For example in the friendship network of Fig 1 we might be interested in finding out if Alice and Bob or Sam 
and Jack could become friends in future. 




Fig 1. Illustration of Link Prediction Problem 
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Link Prediction can be applied in the study of network evolution [6], it also finds its application in recommendation 
system on social networks [3] to recommend friends on Facebook or Google+, to find employers or employees in 
professional networks such as Linkedln, improving sales in E-commerce by suggesting products that customers may 
be interested to buy or even suggest some online shopping websites. Link prediction can also be employed in finding 
experts and collaborations in academic Social network [7] such as co-authorship networks. It can be used in biological 
networks such as protein to protein interaction and metabolic networks. It can be employed to unravel unknown 
connections in the terrorist networks [9] The wide range of applicability of Link Prediction has generated is a lot 
interest among researchers and has attracted people from the fields of Computer science, Physics, Economics and 
Sociology [10]. This has served as a motivating factor for us to work on the problem. Though a lot of work has been 
in this area, numerous methods have been proposed in the literature most of these methods take a static view of the 
Social Network. Social Networks are intrinsically dynamic and this nature has to be accounted. We have worked in 
this direction and come up with a novel method called the Latest Common Friend (LCF) predictor which embodies 
the temporal aspect of Social Networks. The contribution of our work is the following: 

• A new predictor for Link Prediction in dynamic Social Networks LCF has been defined. We compute the 
score by using the time stamp of the edges and hence accommodating Time which is a key attribute of Social 
Networks. 

• We define a Social Network to be an aggregate of a sequence of snapshots. The structure of each snapshot is 
decided based on the window size. 

• The LCF predictor is compared with the standard baseline methods the Common Neighbor(CN) , Adamic 
Adar(AA) and Jaccard (JC) and shown that our method outperforms the baseline methods 


II. Related work 

The methods that exist for Link Prediction can be classified as similarity based approaches, Path based approaches 
and Learning based approaches. Similarity based approaches uses the node’s information or the topology of the 
network to predict links. libnen-Nowell [5] pioneered in proposing topology based prediction metrics that worked 
well on social networks. They worked on proximity metrics like Common Neighbor(CN), Adamic Adar(AA), Jaccard 
Coefficient(JC) etc and proved the prediction capability of these metrics. Some of the path based approaches are 
Local Path(LP) which uses the information of local paths between nodes to predict links and Katz measure which is 
based on ensemble of paths. Learning based approaches can further be classified as feature based or probabilistic 
models. In feature based approaches Link Prediction is treated as a typical binary classification problem and 
supervised classification learning models like Bayes or SVM’s are used to solve it. These learning based approaches 
have difficulties in feature selection and are also computationally demanding as compared to similarity based 
approaches. Some probabilistic models such as Markov Random fields have been proposed. Most of these methods 
have been applied and tested on static networks. Since social networks are dynamic in nature , the relationships 
among the members of the network changes over time it becomes necessary for the prediction methods to imbibe the 
temporal aspect of these networks while predicting the future associations. Recently some approaches have been 
proposed to incorporate the temporal aspect of Social Networks while predicting the future links. In some of these 
methods the metrics used in the static networks are modified to suit dynamic networks. One such work is done by 
modifying the Common Neighbor metric, by finding the common neighbors within two hops [11] in the network. 
Similarly, weighted versions similarity metrics[2] have been employed on a time series graph. A random walk[l] 
based approach has been proposed for uncertain temporal networks in which similarity scores are computed for the 
node and a sub-graph surrounding the node. They integrate time and topological information to produce better results. 
Some Learning based approaches based on unsupervised feature learning method [8] has also been tested on Social 
Networks. 
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III. Preliminaries 

A. Dynamic Social Networks 

We are aware that with every passing minute a lot of activity happens on Social Media. A number of new profiles are 
been added, messages being sent, photos and videos been shared, this keeps changing the structure of Social 
Networks. The timing of the activities on the social networks is indispensable. Hence we define Dynamic Social 
Networks in the following way: 

Let DSN be a Dynamic Social Network, with VD vertices and ED edges such that 

V D E {n 1# n 2# n 3# n 4 .n n } 

ho C {e lt e 2 , c 3 , 64 . ^n} 

For all ej in E D we have an time stamp attribute ts e i indicating the time of creation of the edge ej. 

A Dynamic Social network is divided into N snapshots, each snapshot is represented as a graph. A graph of a 
particular snapshot will have all the vertices and edges contained in the network for a specific time period T. The 
number of snapshots depends on the window size (ws). If ws is 3 then the network will be divided into 3 snapshots 
with each snapshot having a time period T as shown below. 

Go = (V t0 E to ) where V t0 Q V D and E t0 Q E D such that for all ei in E t0 ts ei fall in the time period t 0 

Gi = (V tl E tl ) where V tl Q V D and E tl Q E D such that for all e^ in E tl ts ei fall in the time period L 

G 2 = 04 2 E t2 ) where V t2 Q V D and E t2 Q E D such that for all ei in E t2 ts ei fall in the time period t 2 

Gdsn =00 0 0! UG 2 

here, the length of the time period t 0 , ti and t 2 are equal to T 


B. Problem Definition 

Given an undirected dynamic network G D = { V D , E D } represented as a sequence of snapshots G D 
= {G 0 , G x , G 2 .G Tg } where G t 

= {V t E t } and t ranging from 0 to T, for every node pair (x,y) E V t but 
g E t , the link prediction task is to find if (x, y) will E E t+1 


82 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 





International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


IV. The proposed method : Latest common friend(lcf) predictor 



In the proposed method we define a new predictor the LCF, time stamp of the edges in the network is utilized in the 
computation of LCF score. Every edge in the network is assigned a weight based on the edge’s time stamp. The edge 
with the oldest time stamp is assigned a weight 0 and the edge with the latest time stamp is assigned a weight 1. All 
the other edges are assigned weights between [0-1]. The LCF score of all node pairs in the network is computed by 
considering common friends of the node pair, such that the common friends who have a higher edge weight contribute 
significantly to the score. We compute the cumulative weight of common friends of a node pair as follow: 

Let CF be the list of common friends of node pair (x , y) and N be the number of common friends, E D is set of edges 
of a graph then: 


N Wt(x, CF[K]) + Wt(y, CF[K]) 

K=0 


Wt_of_CF 


i 


LCF_Score(x,y) = || CF || * Wt_(of_CF(x,y) 


Avg_wt_of_N etwork = 


Z!f°o'wt(E D (M)) 
II E d || 


( 1 ) 

( 2 ) 

(3) 


The LCF score of Alice and Bob is computed as follows: 

Common Friends of Alice and Bob are ( Amy, Sam) 
Weight of (Alice, Amy)= 0.3 
Weight of (Bob, Amy) = 0.2 
Weight of (Alice, Sam) = 0.2 
Weight of (Bob, Sam) = 0.1 
Cumulative weight of Friends of (Alice, Sam)=0.8 
LCF score of (Alice, Sam)= 2*0.8=1.6 
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Algorithm LCF_Score 


Input Weighted Graphs {G'.GIG 3 . GP } 

Output: LCF_Score of G’ 

foreveryinputgraph {G’.G^G 3 .G N } 

E <- £ wt of edges in G' 

T<- Total Number of edges in G' 
avg_wt <- E/T 

for each node pair( x, y) in O 

CFr(x) n r(y) 

LCF 0 

for each common friend k in CF,, vk 
LCF F- LCF+1 
Cum_wt <- wt{x , k) ♦ wt(y , k) 
ifcum_wt>avg_wt 

LCF_Score l)( y( <- LCF * Cum_wt 
end if 
end for 
end for 

return LCF_Score 
end for 


The algorithm of the LCF predictor is outlined in Algorithm LCF_Score. The input to the algorithm are weighted 
graphs, the weight to each edge of the graph is computed based on its time stamp. The LCF score for every pair of 
nodes of the graph is computed. The algorithm outputs the LCF score of every node pair of each input graph, based 
on the LCF score of edges in the Graph AUC values are computed. 
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V. EMPIRICAL EVALUATION AND RESULTS 


A. Dynamic Social Networks 

The Fig 3 shows the frame work of our Link Prediction method. There are three distinct phases: 

1. Generate Network Snapshots: From every Data set generate N snapshots of the network with each snapshot 
of time period T, where value of T is decided based on the parameter ws. 

2. Compute Similarity Scores: For every snapshot generated in phase 1, compute similarity scores based on 
LCF, CN, JC and AA. 

3. Compute Performance Measure: Compare the scores of the edges computed in step 2 for a snapshot G { , 
compare the scores of the existing edges and non existing edges in the snapshot G t+1 and compute AUC 
score 


B. Data Sets 

The algorithm is tested on 8 different data-sets. All the data-sets are real time and temporal, they include the 
timestamps of every edge. There are multiple edges between two nodes one edge for each communication between 
the nodes. We consider two nodes to be connected if there is at least one communication either way. If there are 
multiple edges between two nodes the highest time stamp which indicate the latest time of communication is 
considered to be the time stamp of that edge. The Table 1 gives the description of the data-sets used to test our 
algorithms. The first four data-sets are communication data pertaining to the email communication among students of 
academic institutions or corporate communications. The other four are collaboration networks [4] pertaining to co¬ 
authorship or discussion forum through the internet on certain topics. The complete details of these data-sets can be 
found on the URLs provided in the Table 1. 


Tablet. Data-set Description 


Dataset 

# of Nodes 

# of Edges 

Time Span 

(in months) 

Source 

College 

messages 

1899 

59835 

193 

http://snap.stanford.edu/data/ColleseMsg.html 

Enron 

87,273 

1,148,072 

120 

http: //konect. uni-koblenz. de/net works/enron 

Digg 

30,398 

87,627 

1 

http: //konect. uni-koblenz. de/net works/munmun 

digs reply 

EU_ALL 

265,214 

420,045 

803 

http: //konect. uni-koblenz. de/networks/email - 

EuAll 

Math 

Overflow 

24,818 

506,550 

78 

https://snap.stanford.edu/data/sxmathoverflow.ht 

ml 

Ask Ubuntu 

159,316 

964,437 

87 

https://snap.stanford.edu/data/sx-askubuntu.html 

CaJHepPh 

34,546 

421,578 

124 

https://snap.stanford.edu/data/cit-HepPh.html 

CitJHepTh 

27,77o 

352,807 

124 

https://snap.stanford.edu/data/cit-HepTh.html 
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C. Results and Discussion 

The proposed LCF algorithm was tested on 8 different datasets and the results were compared with the traditional 
methods Common Neighbor (CN), Jaccard (JC), and Adamic Adar (AA). The data-sets are divided into 10 snapshots. 
We compute the LCF score for a snapshot at time TO ,T1,...T10. The LCF score is computed for all existing and non 
existing Links of all node pairs. The AUC value is computed using these scores. ”AUC value can be interpreted as the 
probability that a randomly chosen existing link is given a higher score than a randomly chosen nonexistent link”. 
Since we know all the edges that exists at time tl, we compare the scores of a randomly chosen existing link and a 
randomly chosen non existing link. The AUC value is computed as shown below: 

Let n be the total number of comparisons made, if n' times an existent link had a higher score and n" times they have 
an equal score, then: 


AUC = 


n' + 0.5 n" 


The Fig 4, shows the AUC values of 10 snapshots of the 4 communication data-sets, We have the our LCF algorithm 
performing better than the tradition ones except for one data set EU all, in which our CN performs on par with LCF. 
The Fig 5 shows the AUC values of the snapshots of collaboration networks, LCF performs very well in all four 
networks compared to the traditional methods. Finally in Fig 6, we compare the average AUC values for all 8 data¬ 
sets given by LCF and the other traditional methods, our LCF algorithm has higher average AUC values. 
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Fig 4. Performance of LCF and Traditional methods on Communication Networks 
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Fig 5. Performance of LCF and Traditional methods on Collaboration Networks 



College Enron Digg EU_all Math Ask HEP-PH HEP-TH 
Message Overflow Ubuntu 


Fig 6. Average AUC Values of LCF and Traditional methods 
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V. CONCLUSION 

In this work the problem of Link Prediction in dynamic social networks was investigated. We have proposed a new 
algorithm for the problem. Time is a very important attribute in dynamic social networks hence in the proposed 
method we compute the scores for the edges by utilizing the time stamps of the edges. This algorithm has been tested 
on eight real world data-sets and we have shown that it performs better than the traditional Link Prediction algorithm. 
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Abstract — Information on the web is tremendously increasing in 
recent years with the faster rate. This massive or voluminous data 
has driven intricate problems for information retrieval and 
knowledge management. As the data resides in a web with several 
forms, the Knowledge management in the web is a challenging 
task. Here the novel 'Semantic Web' concept may be used for 
understanding the web contents by the machine to offer 
intelligent services in an efficient way with a meaningful 
knowledge representation. The data retrieval in the traditional 
web source is focused on 'page ranking' techniques, whereas in 
the semantic web the data retrieval processes are based on the 
‘concept based learning'. The proposed work is aimed at the 
development of a new framework for automatic generation of 
ontology and RDF to some real time Web data, extracted from 
multiple repositories by tracing their URI’s and Text Documents. 
Improved inverted indexing technique is applied for ontology 
generation and turtle notation is used for RDF notation. A 
program is written for validating the extracted data from 
multiple repositories by removing unwanted data and considering 
only the document section of the web page. 

Index Terms — Semantic Web, Resource description 
framework, Ontology, Improved inverted indexing technique, 
Knowledge management. 


I. INTRODUCTION 

World Wide Web (WWW) is considered as a global 
information repository that identifies documents and other web 
resources by Uniform Resource Locators, interlinked by 
hypertext links. Search engines are used to retrieve the 
information from the web. Data overburden is the most 
concerning issue in these days for the existing system. 
Evolution of web includes the web versions of web 1.0, 2.0 
etc. In this series, the web version 3.0 is referred to as semantic 
web [1] is evolved as a knowledge management support across 
the globe. Search engines should be enriched with semantic 
web capabilities that analyze webpage content and provide 
more relevant results corresponding to the user query. 
Semantic web standards include resource description 
framework (RDF), web ontology, RDF Schema and rule 


interchange format (RIF) for handling data. Resource 
description framework (RDF) provides a conceptual 
description of information for representing the web resources 
like Turtle syntax, N-Triples etc. Resource Description 
Framework (RDF) describes data on the Web in graph form 
[2]. Ontologies consist of the finite set of terms, relationships, 
constraints and axioms [3]. Ontologies have proven to be 
useful for effective knowledge modeling and information 
retrieval. The remaining paper is arranged as follows: In 
Section 2 the related work is presented. The proposed work 
and its methodology are discussed in Section 3 & 4. The 
results are presented in Section 5. Conclusions are given in 
Section 6. 


II. Related Work 

M.S.P.Babu et.al [4] provided the overview of some of the 
semantic search engines that yield unique search experience 
for users. Wilkinson et.al [5] proposed an information retrieval 
system using document structure. Amel Grissa Touzi et.al [6] 
suggested the Fuzzy Ontology of Data mining (FODM) for 
processing automated generation of ontologies in the domain 
of data mining. Amira Aloui et.al [7] implemented a plugin 
named “FO-FQ Tab plug-in”, which can be integrated with 
protege editor for building the fuzzy ontologies from large 
databases. To overcome the drawbacks of the existing system 
for accessing the related science information, M.S.P.Babu et.al 
[8] proposed a new framework for automatic generation of 
ontology and RDF for real-time web data. Tahani Alsubait 
et.al [9] developed the e-learning suite, with the set of 
questions designed using ontological representation. 
A.H.M.Rupasingha et.al [10] suggested that the performance 
of the ontology generation is always dependent on the 
specificity of the terms. Seongwook Youn et.al [11] discussed 
pros and cons of tools like protege 2000, OilEd, Apollo, 
OntoLingua, Onto Edit, webODE, KAON, ICOM, DEO, 
webOnto that is used for ontology creation.Kgotatso Desmond 
Mogotlane et.al[12] presented a comparative study of plugins 
of protege tool like DB20WL and Data Master. 
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III. PROPOSED WORK 


Phase 1: Data extraction 


Semantic web capabilities like RDF & ontology are applied 
to enrich the knowledge. The proposed work is an 
implementation of the framework proposed by the authors [8]. 
The framework is designed with reference to the semantic web 
Stack. It is carried out in two phases, namely Data extraction 
phase and Data representation phase. Web scraping is 
performed using HTML parsing technique in data extraction 
phase by giving sample search query as an input to multiple 
repositories. DOM parsing and HTML parsing techniques are 
applied to validate the data retrieved from multiple repositories 
by considering only the document section of the webpage. 
Extensible markup language (XML) is the base for the 
semantic web representations; the validated information is 
converted into semi-structured notations by using XSD 
declaration from DOM tree and passed as an input for the next 
layers of the proposed framework. XML notation is given as 
an input to data representation phase. RDF notation is 
generated and represented in graphical form using Graphviz 
tool. A textual representation of RDF graph is provided using 
Turtle, the Terse RDF Triple Language. Improved Inverted 
Indexing technique is applied for ontological representation of 
words by excluding the stop words. 


Data extraction phase performs web scraping from multiple 
repositories and stores the scraped data into the database. The 
Data extraction phase is sub- divided into three steps namely 
web scraping, data validation, XML Conversion. The scraped 
data is further validated by removing the unwanted data in the 
considering document section of web page. The data stored in 
table format in the database, after the data validation process, 
is converted into the Semi-Structured Notations (i.e. XML 
Notations) and passed as an input to the data representation 
phase. 

Step 1: Web Scraping 

Web scraping, also be referred as screen scraping or Web 
harvesting, is used to fetch and extract the data from a web 
document using HTML parsing techniques. Here Web pages 
are crawled and the content of the Web page is extracted. The 
data in the Web page includes three sections namely: Web 
page statistics bar, document section and descriptive section. 
The three items are stored in a database as three different 
attributes in a database table. HTML parsing technique is used 
for scraping data from the web documents is shown in Fig 2: 



Figure 1: Proposed framework 
IV METHODOLOGY 

Implementation of the framework proposed in Section III will 
be carried out in two phases namely data extraction and data 
representation phases. The details are given below. 


GOOGLE 


SEARCH KEYWORD | | SEARCH | 

WEB STATISTICS BAR 


DOCUMENT SECTION 


«H2> Data Mining - Investopedia </H2> 

https:/ Avww.investopedia. com/terms /d/datamining.asp 

•=/a href> 

<SPAN> Data mining is the process of discovering Patterns in 

large data sets <c/SPAN> 


DESCRIPTIVE 

SECTION 


<H2> Data mining - Wikipedia «/H2> 

<abr d> https://en.wildpedia.org/wiki/Data mining </ahrefr 

<SPAN> Data mining is the process of discovering 

Patterns in large data sets *=/ SPAN> 


Figure 2: Content of the Web page 
Step 2: Data Validation 

In the Data Validation Step, the data collected from step 1 is 
validated using HTML and DOM parsing techniques. Here 
unwanted data is removed and the necessary portion of URLs 
is retained. In this step web status bar and descriptive sections 
are removed in the database table. The validated data is stored 
in a database. Document section displays the results in the 
form of page title, URL, Snippet (description) for the given 
search query. 



Figure 3: Content of the Web page after Data validation 
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Descriptive section provides the Wikipedia information 
about the input query. The Data validation process is carried 
out by considering only the document section of the web page 
as shown in Fig: 3. 

Step. 3: XML Conversion 


The triple is represented as a <subject, predicate, object> 
format by exploring the relationship among the nodes [14]. 
The XML conversion carried out in step 3 is represented in 
RDF syntax using Turtle notation from the convention 
specified in “ http://www.w3.org/1999/02/22-rdf-syntax-ns# “ 
as shown in Fig 6: 


In XML Conversion Step the data, validated in step 2, is converted 
into a DOM tree using XML Schema Definition (XSD). The 
conversion is performed on data that is validated and stored in 
database by considering each individual field/ attribute into 
namespace convention. The XSD declaration of DOM tree 
has hierarchical structures which have root node, representing 
the search key word and three child nodes, representing Title, 
URL and Description respectively. The XSD declaration of 
the DOM tree with an example is shown in Fig 4: 


< ?xml v*rsiorv="l .0” «ncoding=”UTF-8”?> 

< bookstore) 

ebook category. “cooking") 

ctitle lang=“en“>Indian Cooking</title> 
<author>Sanjeev Kapoor < /author) 
<year>2012< /year > 

<price)80. 00 </pr ice) 

</book> 

ebook cetegpry=“ehildren"> 

ctitle lang=“en”>Fair Talesc/title) 
cauthor)] K. Rowlingc/author) 
<year>2910< /year > 

<price)50.99</price) 

</book> 

< /bookstore) 


1 Root element: 1 

| <bookstore> | 

Parent' 



Child 


Element: I 
<boob f 



_ 



1 Bement 

" 

Element: 1 

1 Element: 1 

1 Element: 1 

I <We> 

J L 

<auttor> | 

1 <year> I 

| <pnce> | 


Siblings 


Figure 4: XSD Declaration of DOM tree 


Phase2: Data representation 


Data extracted from steps 1,2 and 3 is maintained in an XML 
format and is given as an input to data representation phase. In 
Semantic Web architecture, the major source of data 
representation imposes RDF-ization and Ontology generation. 
Hence the data representation phase is sub- divided into two 
steps namely RDF-ization and ontology generation, which are 
explained in detail in step 4 & 5 respectively. 

Step 4: RDF-ization 

The Resource Description Framework (RDF) is the basic 
building block in semantic web, promoting conceptual 
modeling of web data [13]. The RDF-ization process is carried 
out using Turtle notation and Graphviz tool. In this step the 
XML notation data stored in extraction phase is given as an 
input to RDF-ization. The RDF notation is visualized in the 
form of RDF graph using Graphviz tool. Decomposition of 
tuple creates a new blank node corresponding to the row and a 
new triple set is obtained. Each tuple in a relational database is 
decomposed as RDF triples, namely: the title is taken as 
subject, URL is considered as predicate and description is 
taken as object. A node can be a URI reference, literal or the 
blank node. The graph in Fig: 5 is an example of RDF-ization 
process of a semantic net. 


RDF Triple 



Figure 5: RDF Triple 


<?xml version=" 1.0" encoding="UTF-8"?> “XML Declaration” 

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 

xmlns:rbss="www.seinanticweb.org/rbss/#"> “RDF Root element” 

<rdf:Description rdf:about="www.semanticweb.org/rbss/hi"> “Description of Resource identified” 
<rbss:keyword>hi</rbss:keyword> “Resource Property” 

<rbss:title>Hi | Define Hi at Dictionary.com</rbss:title> 
<rbss:url>http://www.google.com/search?q=hi</rbss:url> 

<rbss:description Hawaii (state) 2. Hawaiian Islands. Word of greeting. An American English, 
originally to attract attention (15c.), probably a variant of Middle English hy, hey (late 15c.) also 
an exclamation to call attention. </rbss:description> 

</rdf:Description> 

</rdf:RDF> 

Fig-6: RDF-ization 

The <rdf: Description> element provides the description of 
resource identified by <rdf: about> attribute. The tags erbss: 
title>, <rbss: keyword>, <rbss: URL> are the properties of the 
resource identified. The RDF represented in turtle notation is 
visualized in a graphical format using Graphviz tool .It is open 
source software that is used for generating graphs. 

Step 5: Ontology Generation 

Ontology is defined as a formal specification of 
conceptualization of the domain of Interest. In ontology 
generation step, the RDF notation obtained from step 4 is used 
to create a vocabulary of words using improved inverted 
indexing algorithm. Improved Inverted Indexing algorithm is 
employed on real time web data collected from multiple 
repositories and text documents. The words from the 
description tags are extracted by excluding the stop words and 
frequency count/Term frequency (TF) of each word is 
maintained. The illustration of improved inverted indexing 
algorithm is presented as follows: 

Algorithm: Improved inverted indexing 

Input: Database D= {Ti, T2...T n }, Storage Database 
Output: Attributes {Ai, A2...A n }, where Ai, for i=l,2...n 
are representing ontology vocabulary. 

Parameters: Swrdk= Array of Stop Words 
attsL q= Snippet attribute 
Wordsk= Words stemmed from snippet attribute 
attsLf= Word frequencies after stemming 
attsL= ontology along with the frequencies count. 

1. Swrdk={a}; 

2. for i=0;i<=i+8,i< D do 

3. attsL q =Query Coverage(D,i); 

4. Wordsk=Words Separate(attsL q ,Swrdk); 

5. attsLf= Words Usage frequency(D,attsL q ,WordsK); 
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6. attsL=attsLqU attsLf 

7. f=highest_freq(attsL) 

8. if (f<freq(attsL)) then 

9. sort(Wordsk,freq(attsL)) 

10. end if 

11. end for 

12. return (Wordsk,freq(attsL) 

V. RESULTS AND DISCUSSION 

The Semantic web stack proposed by Tim Berners Lee [15] is 
implemented by using the frame work proposed by the authors 
in section III. It is implemented in PHP version: 5 (Open 
Source scripting language) and MySQL version: 5 (open- 
source relational database management system) environments. 
It is tested on an input with test dataset comprising of sample 
search keywords. Response time is the amount of time that 
elapses from the receipt of the query until the results are 
displayed to user. Response time can be measured on server 
side or client side as shown in Lig 7. 


Response Time 



Ligure 7: Response Time 

Throughput is defined as number of queries executed per 
second (qps). Throughput and response time are observed for 
the set of retrieval operations with respect to the page load 
times. The performance of framework implemented with 
respect to throughput is shown in Lig 8. 



Ligure 8: Throughput 


A sample search query is given as an input and web scraping 
results are shown in Lig 9: 



Rule Based Semantic Search System 



Rule Based Semantic Search System 


ITS. 


Google 

Yahoo! 

bing 



Ligure 9: Web scraping results 

Web scraping performance is evaluated by considering the 
following parameters like database size and count of URL’s 
extracted which is shown in Lig 10: 



■ Websc raping 


Factor 

Result 

Database Size 

96 

URL's Count 

44 6 


Ligure 10: Web scraping analysis 

Scraped data from multiple repositories is given as an input to 
data validation step. The validated data is obtained as an 
output to data validation process by applying HTML and 
DOM parsing technique. Data validation considers only the 
document section of a web page. Data validation results are 
shown in Lig 11: 



Figure 11: Data validation results 

The performance of data validation processing with wanted 
and unwanted data from the scraped data by considering the 
following parameters like database size and count of URL’s is 
shown in Lig 12: 
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□ Database Size 

□ URL's Count 


Factor 

Before 

Validation 

After 

Validation 

Database Size 

96 

76 

URL's Count 

446 

332 


Before After Validation 
Validation 


Name 

Description 

Frequency 

christee 

she is a lady 

1 

Ganesha 

he is a god 

2 

mahesh 

he is a superstar 

3 


Figurel5: RDF Graph generation 



Figure 12: Data Validation 


The validated data stored in database is converted into the 
XML notations by applying XSD declaration as shown in 
Fig 13: 


Ontology generation for the data obtained from multiple 
repositories as well as the text file. Improved inverted 
indexing technique is applied for extracting the words with 
their frequencies discarding the stop words, in the order of 
highest precedence. The result of ontology generation for real 
time web data with frequency is shown in Fig 16: 



Rib Icfitfif 


Rule Based Semantic Search System 



mu**** Kiii« <»») 1 ncwutlt* <irbmps://40cs.ikr«io^.co^e< 

isW/M>rtb *rviie/*t» wlrt'Ht* «lm*c■»»•.•*:> <*acr'jM*#iL sew Us ten » letter in prttim* Mlytta 

since me Ml rtlttse, Ur rovtti* ttti Uni* in *ul>sls services, rtt (Mintfttn tf Mer*** seMces, WttrtlM services, a* s« server no 
• 1 * 1 * prottes m irittrHta rliUw* fcr reilctl* dirties tut enc««Mses at* deMl*</iescnitiiO» </ncart» 


(ttOcmaiM Nt* sclme, IK uti, Mlyttci. u«/tltl<» 




Google 

Yahoo! 

bing 


ID 

Word 

Frequency 

1 

web 

23 

2 

mining 

22 

3 

data 

16 

4 

information 

12 

S 

projects 

12 

6 

mining, 

11 

7 

techniques 

3 

a 

discover 

6 

9 

la rge 

5 

10 

patterns 

5 

11 

identic 

5 

12 

engineering. 

4 


Title 

URL | Dtacrip-tion 

IEEE 

2017 

http://ww 
w .ieeefin a 
hearp 

Web mining is the 
integration of 
information gather 

Web 

Minin 

E 

h ttp ://slo g 
rt.in/proje 

cts 

S-Logit offers Best 
Projects in Web 
Mining, final y 

Web 

a 

https ://w 
nw.es .nic. 
edu/~l 

Web content mining. 
Structured data 

extraction, se 

What 
is We 

https ://w 
ww.techo 
pedia.ee 

Web Minin g 
Definition - Web 
mining is the process 

What 

is W r e 

https ://w 
ww.scaleu 
nhmite 

Web mining is the use 
of data mining 
techniques to 

Web 
Alin in 

S 

b ttp ://dm 
r.cs.umn.e 
du.Pap 

Chapter 21 Web 
Alining a€™ Concepts, 
Applications, 

3JU»isi 

h ttp ://ww 
w,tutorial 
s point 

Data Alin in g World 
Wide Web - Learn 

D a ta Min 


Figure 16: Ontology Generation for Real time web data 


Figurel3: XML Conversion 

Resource Description Framework (RDF) is a recommended 
standard of World Wide Web Consortium (W3C) [16]. RDF 
representation of data in turtle form is shown in Fig 14: 


The highest frequency word is considered as a frequent search 
term for the purpose of rule framing using description logic. 
The rule mapping is done for the efficient retrieval operation 
which will be future work. The result of ontology generation 
for text document is shown in Fig 17. 


Data Representation converted into RDF Notation 


<?xnl v*rsion:"1.0’ encoding=“UTF-8*?> 

<rdf:H)F 

Mliujpdfc-http: //»*. m 3 . org/1999/02/22-rrff-syntax-nsT 
mins: rbss:"wMi. seNnticmb.org/ rtes/f’> 

<rdf:Description rbf:abo«t:'ym.saiarticmb.org/rbss/seiBfrtic veb expert systei") 

<rbss:ke)wrd>saiantic web expert systeiK/rtes:ke)word> 

<rbss:title>Tc«ards the Seiantic neb Expert Syste* • ortus</rbss:title> 

< rbss:urlMittp: //mm .google.cor/search ?q:sef®rticr«eb+experc+systet</rbss:url> 

<rbss:descriptionX/rbss:description > 

<rbss:eate>2017-12-87 06:39:36</rbss:date) 

</rdf: Description 

<rdfdescription rdf:about:' mm. saiarticmb.org/rbss/seNrtic web expert systei") 

<rbss:keyword)senartic web expert systeiK/rbssckeyword) 

<rbss:title>Oirtology Merging in the Context of a Seiantic deb Expert Syste* ...</r*ss:title> 

< rbss: vlhhttp: //www. google. coiv'search ?q=semantic-Hweb4expert-fsystH</rbss: url> 

< rbss: descriptionX/rbss: description) 

<rbss:date)2817-12-67 96:39:37</rbss:date) 

</rdf:Description> 


Rule Based Semantic Search Engine Abstract: Now 
a days, the trending research diversions is made 
towards Artificial intelligence, Social Media and 
Machine learning, where the areas like machine 
learning and Artificial Intelligence focus on self 
decision schemes basing on input fed to it which can 
be either the output generated from it or the new 
input given by the user. 


Word 

Frequency 

Rule 

1 

Based 

1 

Semantic 

1 

Abstract 

1 

Now 

1 

days, 

1 

trending 

1 


Figurel7: Ontology Generation for text document 


Figure 14: RDF-ization results Turtle form 

RDF generation for sample relation named “testrdf” which has 
an attributes as <name, description, freq> is considered. The 
“testrdf” represents the relation name is considered as a class 
in an RDF graph and has set of three nodes that are connecting 
the testrdf in depth wise manner represents the tuple of a 
relation as shown in Fig 15: 


VI. CONCLUSION 

The evolution of web has taken many forms namely web 1.0, 
web 2.0, web 3.0 , web 4.0 which lead to high-end 
information retrieval systems using semantic web. The existing 
traditional system collects the data from search engines is 
exhibiting average performance in retrieval. Implementation of 
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proposed framework for automatic generation of ontology’s 
and RDF improves the performance of traditional search 
engines by incorporating semantic capabilities. It includes the 
application of HTML parsing technique, DOM parsing 
techniques and Turtle notation of graphviz tool. The algorithm 
improves information retrieval in Semantic Web and Expert 
Systems. The future work includes applying efficient 
cryptography for securing database and rule framing for the 
design of an expert system. 
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Abstract : In this research, we broaden the advantages of nonnegative garrote as a feature selection method and empirically 
show it provides comparable results to panel models. We compare nonnegative garrote to other variable selection methods like 
ridge, lasso and adaptive lasso and analyze their performance on a dataset, which we have previously analyzed in another 
research. We conclude by showing that the results from nonnegative garrote are comparable to the robustness checks applied to 
the panel models to validate statistically significant variables. We conclude that nonnegative garrote is a robust variable 
selection method for panel orthonormal data as it accounts for the fixed and random effects, which are present in panel datasets. 

Keywords: nonnegative garrote, feature selection, panel models, fixed effects 

A. Introduction 

Large number of variables and observations characterizes big datasets. They are ‘big’ in dimension and length. 
The challenge before researchers is to extract all valuable information from big data for statistical inference. 
Researchers can derive valuable information by identifying the significant variables and/or reducing the size of the 
sample. In data mining theory, these methods are known as dimensionality reduction. 

Dimensionality reduction results in reducing the size of a dataset so that the researcher can extract all significant 
information for research. It includes two techniques - feature selection and feature extraction. Feature extraction 
transforms high-dimensional data into a space of fewer dimensions, which contain all significant characteristics of 
data for statistical modelling. The data transformation may be linear, as in principal component analysis (PCA), 
linear discriminant analysis (LDA) or nonlinear, as in non-linear LDA, kernel PCA. Feature selection, in contrast, 
reduces the number of variables by choosing those, which are statistically significant. As feature selection excludes 
variables from a statistical model, the reduction of dimensionality is performed by variable selection. What type of 
dimensionality reduction technique will be used depends on the data type, researcher’s goals and the quality of data. 

In this research, we make comparison between the performance of variable selection methods like ridge, lasso, 
adaptive lasso and nonnegative garrote. Although these methods are popular in academic literature, we will test their 
performance on panel data to prove empirically that nonnegative garrote outperforms other feature selection methods 
in panel data. As the nonnegative garrote is the only variable selection method that accounts for fixed and random 
regressors, we show empirically that nonnegative garrote provides robust variable selection in datasets without 
multicollinearity. Thus, we broaden the set of advantages of nonnegative garrote. Next section reviews academic 
literature. Section 3 and 4 present the theoretical and empirical framework. Section 5 concludes. 

B. Literature overview 

For many years, one of the most popular methods for variable selection has been the selection criteria of Mallows. 
During the 1960s, the Mallows’ C was new and modern tool to select significant variables in regression problems. 
As linear regression suggested a pool of 2k possible models and only a few would profit good enough fit of data, 
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criterion for choice of the best model was necessary. Mallows [4] invented Mallows’ Cp as criterion for rejecting 
unfit models to filter the ones, which provide best fit of data provided the best model had the smallest prediction 
error. Gorman and Toman [3] explored further the advantages of the selection criterion using visualization tools. The 
best fit between 2k models would select based on the combination of variables, which provided the smallest 
prediction error. The Mallows’ criteria appeared to be simple for visual and numeric interpretation. 

Due to computational exhaustion and other disadvantages of Mallows’ criterion [5], researchers continued to look 
for other methods to select statistically significant variables in linear regression problems. The Russian 
mathematician Tikhonov, who worked on mathematical applications in physics, invented ridge regression as a 
solution to ill-posed problems. In 1962, Hoerl [2] applied ridge regression as a variable selection method in statistics 
for the first time. Unlike, Mallows’ Cp criterion, ridge regression imposes penalty on regression coefficients. Some 
of the coefficients shrink to a number very close to zero, which filters out the variables with insignificant impact on 
the model. Hoerl and Kennard [6, 7, and 8] defined the ridge trace and its applications in orthogonal and 
nonorthogonal problems. Unsolved was the problem about a way of choosing the optimal value of the ridge penalty 
parameter. In 1979 generalized cross-validation (GCV) was proposed as a method for selecting the ridge parameter 
[9]. In modern research articles, ridge regression is applied with GCV to various statistical problems. Although it 
does not shrink regression coefficients to zero, it is used along with other variable selection methods for model 
validation. The use of ridge regression is broader which justifies its application nowadays in data analysis [10]. 

Ridge regression as a feature selection method is often compared to lasso regression [11]. Lasso was devised in 
1996 [11] as a method which filters significant variables. Similar to ridge, lasso imposes penalty to regression 
coefficients but unlike ridge, shrinks some coefficients to zero. In 2006, Efron and Hastie [19] devised least angle 
regression to perform shrinkage. The advantages of lasso range from fast computation to being easily used to provide 
solution paths to other methods like lasso. Many software packages, for example, provide solutions for lasso 
regression based on lars paths [20]. 

Lasso [16] is based on another shrinkage method introduced a year earlier - nonnegative garrote [12]. 
Nonnegative garrote has an advantage to lasso as it can provide robust results in small and big datasets. The 
difference among ridge, lasso and nonnegative garrote appears to be in the type of constraint on parameters. 
Depending on the constraint, the methods yield different results. Although ridge and lasso use cross validation to 
select the value for the tuning parameter [9 and 13], the nonnegative garrote can use both cross validation and 
bootstrapping depending on the data type [12]. The two methods and their adaptations are widely used in various 
topics [13 and 14]. 

Zou [15] criticized the lasso approach for yielding a non-robust solution in big datasets. He described its 
disadvantages in details in [15] and proposed oracle method called adaptive lasso. Adaptive lasso overcomes some 
disadvantages of lasso by weighing the variables in a dataset. He theoretically justified his hypothesis that adaptive 
lasso performs more reliable variable selection than lasso. The properties of adaptive lasso allow variable selection in 
big datasets with heteroscedasticity [17] and for choice of generalized empirical likelihood estimators [19]. 

In 2005, Zou and Hastie introduced a hybrid between ridge and lasso called elastic net regularization [21]. The 
elastic net combines the penalty terms of lasso and ridge and outperforms lasso. The elastic net can be applied in 
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metric learning [22] and portfolio optimization [23]. Similar to lasso [24], elastic net can be reduced to support 
vector machine model for classification problems [25 and 26]. 

The evolution of variable selection methods has led to adaptations of these methods depending on the research 
goal and the quality of dataset. In the next section, we will provide the theoretical framework behind ridge, 
nonnegative garrote, lasso and adaptive lasso as they appear in the original research articles. 


C. Methodological Framework 

In this section, we will present the theoretical framework behind the shrinkage methods. They all fulfill the task of 
minimizing a target given constraint on the coefficients. The shrinkage methods differ only by the type of the 
constraint. 

Ridge regression [11]: 

ft- = arg min \\y - 2J =1 Xjft|| Z + A2J =1 ft z (1) 


As X goes towards zero, the ridge coefficients coincide with the coefficients of the OLS regression. When X goes 
towards one, ridge coefficients shrink towards zero. 

A special case of the estimator for an orthonormal matrix is 


ft 


ridge p } 


■■ als 


1 -|- A 


( 2 ) 


Equation 2 shows that with the shrinkage of coefficients, variance is minimized but bias is introduced. Ridge 
regression rarely shrinks regression coefficients to zero. Instead, it shrinks OLS coefficients to values close to zero. 
Lasso regression [11]: 


ft = argmin ||y- 2? =1 x ; ft|| 2 + *ZjU|ft| ( 3 ) 


The lasso, on the other hand, shrinks some coefficients to zero, thus eliminating the statistically insignificant 
variables. The penalty A | is called LI penalty. The penalty parameter X determines the amount of 

shrinkage. When X = 0, no shrinkage is performed and the lasso parameters equal the estimates of the OLS 
regression. With the increase of the value of X, more parameters are excluded from the regression. When X = go , 
theoretically all coefficients are removed. 

A disadvantage of the least absolute selection operator (lasso) is its inability to perform robust variable selection in 
datasets, which are highly correlated. To solve this problem, Zou and Hastie [21] devised the elastic net, which is 
another regularization method, which combines penalty terms from ridge and lasso. The avid reader can review their 
article "Regularization and Variable Selection via the Elastic Net" for further details on elastic net. 

Despite the advantage of lasso over ridge in terms of variables shrinkage, lasso is robust only in very big datasets. 
It fails to provide robust feature selection in smaller datasets. Adaptive lasso was devised [15] to overcome this 
disadvantage of lasso. 

Adaptive lasso [15] : 

ftu = ar g min ||y - 2J =1 *jft || 2 + A£J =1 w, |ft | (4) 
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The adaptive lasso estimator has similar structure as lasso but the difference is the each coefficients has weight in 
the penalty term. The tuning parameter behaves as X in lasso. By weighing the coefficients, statisticians avoid 
spurious elimination of variables. Adaptive lasso can result in robust feature selection even in smaller datasets. 

Choosing the tuning parameter in feature selection method is another important issue. We have used the k-fold 
cross validation as the underlying method for choosing the optimal value of the tuning parameter in ridge and lasso. 
The k-fold CV being the most widely used. The k-fold cross validation draws k different samples from a dataset and 
compares the models’ error. The goal is to be chosen the value of the tuning parameter, which minimizes the CV 
error. The k-fold CV method is given by equation 5: 

CVQ L)= iZ£=iE t (A)(5), 

Where K is the number of samples drawn from the dataset and k varies from one to K. 

K-fold CV can be used for determination of the tuning parameter in the nonnegative garrote (eq. 6). However, 
nonnegative garrote uses k-fold CV when the assumption for the random independent variables is fulfilled. The 
nonnegative garrote can select significant variables in small and big datasets. 

Nonnegative garrote [14, 29] 

arg min ^\\Y — Zd\\ 2 + nX^ =1 dj (6) 

Where dj >0 for all j and 

Z = (Zi . Z p ),Zj = Xjpf 

b ls 

being the least square estimate, X is tuning parameter. The nonnegative garrote estimate of the regression 
coefficient is given by 

# S U) = d } (X)Pjs,j = 1 ...p (7) 

Under orthogonal designs the nonnegative garrote estimator can be expressed by 

rf ; a> = ^i--^jj = i....p(8) 

The shrinkage factor, as a result, will be close to one if the least square estimator is large. If the least square 
estimator is small, the shrinkage factor can reduce to zero. 

The problem about the optimal value of the tuning parameter in the nonnegative garrote case can be solved either 
by k-fold cross validation or by little bootstrap procedure [1]. K - fold CV in nonnegative garrote, unlike lasso and 
ridge, is performed if the independent variables are assumed to random and uncorrelated. Lasso and ridge lack such 
an assumption. When nonnegative garrote assumes fixed independent variables, the little bootstrap procedure 
described in [4] can select the tuning parameter. When the X variables are random, the selection of the best tuning 
parameter can be performed by cross validation [1]. 

Nonnegative garrote differs from other variable selection methods like ridge and lasso in its ability to perform 
robust variable selection for various assumptions for X variables (random or fixed) [1]. This property of nonnegative 
garrote reminds of panel models where the panel OLS accounts for fixed or random effects. The equation of the 
panel models is: 

Yit — fif (x) + Cit + Sit (9), 
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Where Cu denotes the types of effects and Sit marks the error term. Although academic literature has not 
investigated the connection between panel models and nonnegative garrote as a panel variable selection method, we 
believe there is such connection. More specifically, we have conducted a research of panel fixed effects model and 
compared the results of variable selection via nonnegative garrote with little bootstrap to conclude that nonnegative 
garrote for fixed X variables successfully performs feature selection in panels. We compared the results of the 
nonnegative garrote procedure to lasso, ridge and adaptive lasso and discovered that they fail in panel datasets. 

Although we base our conclusion on one dataset and many experiments must be conducted, we believe this 
observation can be used to test the robustness of panel models and, on the other hand, it can be a time saving 
instrument for initial investigation of panel models. 

Scientists apply panel models to datasets, which contain both time series and cross-sectional observations. The 
problem with panel datasets is the fact that the model should account for the effects of time and individual 
characteristics. To do that, scientists perform tests (Hausman [30], Pesaran [31]) to examine whether time and/or 
individual effects are present. They also investigate whether data are connected by factors that are characteristics of 
the dataset (fixed effects) or randomness is the underlying process in dataset (random effects) [28]. As a result, 
finding the right panel model is usually time-consuming process as it includes running a big amount of models and 
testing them for robustness. 

We believe the similar results between panel models and the nonnegative garrote can be attributed to the 
similarities between panel fixed effects and little bootstrap in the X fixed case. In the X random case, the panel 
random effects can describe a procedure similar to cross validation. Breiman [1] describes the little bootstrap 
procedure as a method for choosing random samples with replacement from a dataset. The replacement corresponds 
with fixing the X variables to be a particular quantity [1]. The panel fixed effects models also assume X variables to 
be fixed quantities. Thus, the little bootstrap procedure generates random samples of fixed X and finds the tuning 
parameter of nonnegative garrote, which results in the smallest MSE in the fixed case. Then, the nonnegative garrote 
shrinks the coefficients of the fixed variables and selects only those, which are statistically significant. The fixed 
effects panel model, on the other hand, is based on panel OLS method, which estimates the coefficients of X 
variables with fixed quantities. The estimates of some coefficients can be statistically significant that leads to their 
elimination. Despite the fact that nonnegative garrote with bootstrap and panel fixed effects models choose the 
significant variables in different ways, the two of them address the problem of fixed quantities in X variables and 
should result in similar statistical significance. 

Similar parallel between random panel effects models and nonnegative garrote with cross validation can be done. 
Random effects assume that X quantities are random. The panel model estimates the coefficients, based on a panel 
OLS method with random independent variables. Some of the coefficients become statistically significant. Cross 
validation in the nonnegative garrote, procedure chooses random samples from data, which, unlike the little 
bootstrap, are independent. In this way, cross validation accounts for randomness in data. Once cross validation 
estimates the MSE of many random independent samples, the tuning parameter for the nonnegative garrote can be 
selected and shrinkage can be performed. The statistically insignificant variables are shrunk to zero. 
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In the next section, we show our experiment on a panel dataset about property rights which contains fixed effects. 
Although our research needs to be extended with bigger amount of datasets, we believe that our results contribute to 
the practical advantages of the nonnegative garrote. 

D. Results 

We have carried out our experiment on a two datasets - property rights dataset, described in table 1 and a 
scoreboard of indicators for financial crisis in the EU described in [30]. 

We have analyzed dataset 1 in a research for determinants of property rights in a panel of data [29]. Our analysis 
shows that panel fixed effects models are present in the dataset. We have outlined the main determinants of property 
rights (model 1) and tested whether shadow economy (model 4), income (model 3) and gender inequality (model 2) 
affect the index of property rights. Using the panel OLS method, we have statistical significance of variables 
presented in table 1: 


Table 1: OLS Panel method results: 



Model 1 

Model 2 

Model 3 

Model 4 

dbirth 

-0.06** 

-0.06** 

-0.06** 

-0.05* 


(-28.66) 

(-28.67) 

(-27.47) 

(-25.22) 

drate 

-0.10** 

-0.10** 

-0.10** 

-0.11** 


(-26.17) 

(-26.16) 

(-25.81) 

(-27.16) 

mortality 

-0.05*** 

-0.05*** 

-0.05*** 

-0.05*** 


(-70.89) 

(-72.08) 

(-72.58) 

(-70.78) 

unempl 

-0.02** 

-0.02** 

-0.02** 

-0.02** 


(-27.96) 

(-27.96) 

(-25.91) 

(-28.54) 

urban 

0.01*** 

0.01*** 

0.01*** 

0.01*** 


(35.26) 

(34.75) 

(32.54) 

(35.84) 

dmilitary 

0.26* 

0.27* 

0.27* 

0.26* 


(24.76) 

(24.78) 

(25.06) 

(24.08) 

dgender 


-0.00 





(-0.05) 



Igini 



-0.09 





(-0.51) 


dshadow 




q Q7*** 





(40.37) 

R 2 

0.67 

0.67 

0.67 

0.68 

F-stat 

135.59*** 

115.93*** 

116.55*** 

118.45*** 

2000- 

Data set 

2000 -2014 

2000 -2014 

2000 -2014 

2014 


Annual 

Annual 

Annual 

Annual 

N 

420 

420 

420 

419 


Source: authors’ calculations 


According to the results of our experiment, income and gender inequality are statistically insignificant. The panel 
models lack cross sectional dependencies and multicollinearity and robustness checks have been made [29]. 
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We have also performed variable selection on the transformed dataset (table 1) via lasso, ridge, adaptive lasso and 
the nonnegative garrote for fixed X variables. Table two presents the results of feature selection methods and 
compares them with panel models (1-4): 


Table 2: Feature selection methods 

Lasso Ridge Adaptive Lasso NNG Panel models significance 


dbirth 

-0,05 

-0,05 

0 

0,00 **/* 

drate 

-0,11 

-0,13 

-0,03 

-0,05 ** 

mortality 

-0,04 

-0,04 

-0,04 

-0,48 *** 

unempl 

-0,02 

-0,02 

-0,02 

-0,22 ** 

dshadow 

0,04 

0,04 

0,01 

0,01 *** 

dgender 

0 

0,01 

0 

0,00 

infl 

0 

0 

0 

0,00 

loginternet 

-0,02 

-0,01 

0 

0,00 

dexpect 

-0,05 

-0,07 

0 

0,00 

emissions 

0,01 

0,01 

0,01 

0,00 

dhealth 

0 

0 

0 

0,00 

urban 

0,01 

0,01 

0,01 

0,33 *** 

lgini 

-0,07 

-0,11 

0 

0,00 

dmilitary 

0,2 

0 

0,13 

0,08 * 


Source: authors’ calculations 


When we compare the results of the feature selection methods with the results of the panel models we see that 
lasso identifies 11 non-zero variables against seven significant variables from the panel models. Zou and Hastie [15] 
criticize lasso estimation for providing non-robust estimates in relatively small datasets and highly correlated 
variables. Although the variables lack multicollinearity and autocorrelation, the number of observations in the dataset 
is relatively small, so lasso fails to perform robust shrinkage of all insignificant variables. Another reason why lasso 
fails in this dataset is the presence of panel fixed effects, which are absent in cross-sectional and time series datasets. 
Lasso does not account for panel effects, fixed or random. Lasso was designed with the purpose to perform shrinkage 
of coefficients to zero in large datasets, most often applied on cross sectional data [11]. As a result, fixed and random 
effects in panels were not included in the theoretical framework of lasso and lasso fails to provide robust shrinkage in 
panel data. 

The disadvantages of lasso have encouraged Zou and Hastie [15] to propose a weighted version of lasso, which is 
robust in the presence of multicollinearity and smaller number of observations. They called it adaptive lasso. In table 
2, we see that the adaptive lasso outlines seven variables with non-zero coefficients and their amount coincides with 
the amount of significant variables in panel models. However, a more detailed analysis shows that the non-zero 
variables from the adaptive lasso are different from the statistically significant variables in the panel models. 
According to the adaptive lasso, the birth rate is not significant and emissions of carbon dioxide are significant, 
which differs from panel models. Similar to lasso, adaptive lasso does not account for panel effects. Ridge regression 
[2] results in similar estimations like lasso with more nonzero coefficients as it does not shrink to zero. From all three 
methods ridge is the least robust as it does not perform variable selection in panel data and brings no valuable 
information about the dataset. 
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Bearing these results in mind, we wanted to examine whether nonnegative garrote with little bootstrap can provide 
competitive results to panel models. As Leo Breiman [1] has introduced the advantages of nonnegative garrote as a 
feature selection method, we raised the question of whether it can be robust in panel data. Breiman [1] has proposed 
nonnegative garrote as an alternative of best subset selection and ridge. Later, Tibshirani and Hastie [11] have 
developed lasso based on the nonnegative garrote of Breiman. According to Breiman, nonnegative garrote performs 
robust variable selection not only in bigger datasets like lasso, but also in smaller ones. However, it fails when the 
dataset has multicollinearity and outliers. As our dataset lacks outliers and multicollinearity and has fixed effects, we 
ran nonnegative garrote with little bootstrap on our dataset. 

As the nonnegative garrote is the only variable selection method, which proposes two selection procedures for the 
tuning parameter based on the assumptions for the X variables, it is reasonable to test it on panel datasets. As table 
two shows, the nonnegative garrote selects the same variables as the panel models. Although the coefficients have 
different values in nonnegative garrote and panels, the same variables appear to be statistically significant. According 
to the nonnegative garrote, the birth rate has zero coefficient, so it is statistically insignificant but in models, 1-3 it is 
significant at the 1% significance level and in model 4 the significance level is 10%, which shows the effects of birth 
rate may be negligible. Thus, this result is comparable with the result of the nonnegative garrote. 

As the fixed effects in panel data imply fixed X variables and the little bootstrap in nonnegative garrote uses 
sampling with fixed variables, nonnegative garrote can be successfully applied as a panel feature selection method. 
In this research, we have validated the results from the panel models with panel GMM and robust covariance matrix 
(table 3). After we have confirmed the results, we have run nonnegative garrote. The results show that nonnegative 
garrote with little bootstrap was another way to validate the results of our research. Not only the selected variables 
via nonnegative garrote coincided with those from the panel models, but also with the results of the robustness 
checks. Nonnegative garrote proved to be useful not only for smaller datasets in variable selection in cross sectional 
data but also in panel data. As a result, nonnegative garrote with little bootstrap can be applied to fixed effects panel 
models without multicollinearity and outliers. Nonnegative garrote with cross validation can be applied to random 
effects panel models. 

Table 3: Robustness checks 



Model 5 

Model 6 

Model 7 

Model 8 

DBIRTH 

-0.06** 

-0.08*** 

-0.06** 

-0.05* 


(-1.93) 

(-2.57) 

(-1.92) 

(-1.60) 

DRATE 

- 0 . 11 *** 

- 0 . 10 *** 

- 0 . 10 *** 

- 0 . 11 *** 


(-2.35) 

(-2.53) 

(-2.35) 

(-2.41) 

MORTALITY 

-0.05*** 

-0.05*** 

-0.05*** 

-0.05*** 


(-15.93) 

(-14.30) 

(-15.33) 

(-15.78) 

UNEMPL 

- 0 . 02 *** 

- 0 . 02 *** 

- 0 . 02 *** 

- 0 . 02 *** 


(-7.08) 

(-7.56) 

(-6.57) 

(-7.07) 

URBAN 

0 . 01 *** 

0 . 01 *** 

0 . 01 *** 

0 . 01 *** 


( 10 . 21 ) 

(9.95) 

( 10 . 01 ) 

(10.32) 

DMILITARY 

q 27 *** 

0.26*** 

q 27 *** 

0.26*** 


(3.96) 

(3.83) 

(4.01) 

(3.86) 

DGENDER 


0.01 
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(0.81) 

LGINI -0.09 

(-1.19) 


DSHADOW 

R 2 

0.67 

0.63 

0.67 

0.08*** 

(2.39) 

0.68 

Instrument rank 

22 

9 

23 

23 

J-stat 

139.34*** 

105.40*** 

149.22*** 

137.48*** 


Source: authors’ calculations 


Similar conclusion can be made from table 4 where we have analyzed the determinants of the change in the GDP 
in the context of EU scoreboard indicators for financial crisis [29]. Table 4 presents the results from the panel models 
with period and cross section fixed effects. Table 5 makes comparison between the panel model and the feature 
selection methods, which we have applied to the property rights dataset. 

Table 4: Panel two-way fixed effects model for scoreboard of indicators for financial crisis in the EU 
Dependent Variable: GDPCHANGE _ 


Variable 

Coefficient 

Std. Error 

t-Statistic 

Prob. 


c 

1.808557 

0.736731 

2.454842 

0.0144 

*** 

dcpi 

0.055932 

0.020464 

2.733148 

0.0065 

*** 

dfincon 

-0.109693 

0.040897 

-2.682185 

0.0075 

*** 

dgfcf 

0.307502 

0.053179 

5.782384 

0.0000 

*** 

dggdebt 

-0.010726 

0.010249 

-1.046570 

0.2958 


dnip 

-0.012993 

0.005526 

-2.351405 

0.0191 

*** 

dpsdebt 

-0.001152 

0.004332 

-0.265959 

0.7904 


drd 

0.058476 

0.193896 

0.301584 

0.7631 


dreer 

0.005063 

0.016126 

0.313974 

0.7537 


dresidential 

0.137797 

0.145171 

0.949207 

0.3429 


lactivity 

0.188133 

0.111589 

1.685951 

0.0924 

* 

ldcredit 

-0.056197 

0.118423 

-0.474543 

0.6353 


privatecf 

0.021183 

0.012114 

1.748612 

0.0809 

* 

riskpovst 

0.033869 

0.019763 

1.713716 

0.0871 

* 

totalfsliab 

0.027452 

0.007956 

3.450620 

0.0006 

*** 

ulc 

-0.019610 

0.003790 

-5.174630 

0.0000 

*** 

youthunemp 

-0.031945 

0.013864 

-2.304101 

0.0216 

*** 

R-squared 

0.620469 





F-statistic 

14.04933 





Prob(F-statistic) 

0.000000 






Source: authors’ calculations 


Table 5: Comparison feature selection / panel model for scoreboard of indicators 



ridge 

lasso adaptive 


nng 

panel 

significance 

dcpi 

0,01671 

0,02767 

0,07399 

0,12600 

*** 

dfincon 

-0,07761 

-0,05544 

0,00000 

-1,82711 

*** 
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dgfcf 

0,23082 

0,44817 

0,58651 

3,18178 *** 

dggdebt 

-0,01821 

0,01809 

0,00000 

0,00000 

dnip 

-0,00686 

-0,00558 

0,00000 

0,00368 *** 

dpsdebt 

0,00131 

0,00000 

0,00000 

0,00000 

drd 

0,01177 

0,00000 

0,00000 

0,00000 

dreer 

0,01476 

0,00000 

0,00000 

0,04267 

dresidential 

0,33380 

0,28351 

0,00000 

0,00000 

lactivity 

0,02823 

0,00000 

0,00000 

0,00515 * 

ldcredit 

-0,11780 

-0,07883 

0,00000 

0,00000 

privatecf 

0,02555 

0,03043 

0,02798 

1,06233 * 

riskpovst 

-0,01291 

0,00000 

0,00000 

0,32034 * 

totalfsliab 

0,01771 

0,03002 

0,03608 

1,28388 *** 

ulc 

-0,00515 

-0,00988 

-0,01187 

0,00000 *** 

youthunemp 

-0,01350 

-0,00493 

0,00000 

0,73674 *** 


Source: authors’ calculations 


The column for panel significance in table 5 shows which variables are statistically significant according to the 
panel two-way fixed effects model. When we compare ridge with the panel model, it becomes clear that ridge failed 
to capture statistical significance, as all coefficients are different from zero. Ridge is not a reliable method for feature 
selection in panel datasets with fixed effects. 

Similarly, to the property rights dataset, lasso shrunk some variables to zero. However, lasso identified variables, 
which are not significant in the panel model. For instance, the differenced government debt and the domestic credit 
are not significant in the panel in contrast to lasso. Lasso returned the log of activity rate (lactivity) as insignificant 
when, in fact, it is significant. Similar is the case with the risk of poverty (riskpovst). The lasso regression failed to 
perform feature selection successfully. 

Adaptive lasso in contrast to lasso and ridge resulted in feature selection similar to the panel model. Some 
discrepancies can be observed, however. The adaptive lasso selected the differenced final consumption (dfincon), the 
net international investment position (nip), the log of activity rate (lactivity), the risk of poverty and the youth 
unemployment as insignificant when, in fact, they are significant. The accuracy of the adaptive lasso in the dataset is 
questionable. 

When we look at the nng column, we see that the nonnegative garrote outlined as significant variables almost the 
same variables as the two-way panel model. The only discrepancy is in the dreer variable. The nonnegative garrote 
considers the real effective exchange rate to have nonzero coefficient while the panel model excludes it as 
insignificant variable. According to the nonnegative garrote, if the rate of increase in the real exchange rate becomes 
faster, the GDP will increase by 0.04%, which may be a very small, almost unobservable increase. In contrast to the 
other feature selection methods, the nonnegative garrote resulted in identifying all significant variables in the panel. 

Similarly to the property rights dataset, the nonnegative garrote performed accurate feature selection in the panel 
scoreboard indicators for financial crisis. Despite the fact that different indicators and dependencies characterize the 
two datasets, the common line between them is the presence of fixed effects. The presence of fixed effects in panel 
models is modelled by OLS with fixed effects. As the little bootstrap captures the fixed effects, the nonnegative 
garrote successfully performs variable selection in panel data. 
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E. Concluding remarks 

Breiman [1] introduced the advantages of nonnegative garrote as a variable selection method. He proved its 
computational advantage to subset selection methods and its ability to perform robust variable selection in small and 
big datasets, unlike lasso and ridge. We broaden this set of advantages by showing that nonnegative garrote, unlike 
other variable selection methods, accounts for fixed X variables by little bootstrap procedure in fixed effects panel 
data and for random panel effects by cross validation. Although we lack sufficient empirical evidence, we believe 
our finding has an important impact on how panel data can be preprocessed. 
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ABSTRACT- Cloud vendor lock-in is one of 
the major problems in cloud computing where the 
customer is locked to a particular vendor so that it 
will be difficult to migrate from one cloud to the 
other. The problem is that once an app has been 
developed based on a particular cloud service 
provider’s API that apps is bound to that provider 
as a result of which migration from one cloud to 
the other becomes more complex because of 
changes in architectures of different cloud service 
vendors[2]. The problem can be solved by 
providing a standardized way of interacting with 
cloud service providers taking many factors into 
consideration and by isolating each individual 
module involved in the cloud service provider’s 
API and bringing out the common things and 
uniting them together so that in future any CSP 
will have to obey that specific standards and build 
their APIs without the need of creating a new 
standard that makes migration from/to that CSP 
complex. 

Key Words: - Vendor lock-in, API, Migration, 
CSP 

1. INTRODUCTION 

Cloud computing is one of the most rapidly 
growing technologies in the world of Internet 
because it provides on-demand services to both 
end users as well as developers by providing 
various services such as SaaS, PaaS, and IaaS. As 
the popularity of cloud computing is growing 
higher and higher the problems are also growing 
much less at the same pace. One of the major 
problems is vendor lock-in where the customer is 
locked in to a particular vendor so that he/she has 
to use only that particular CSP when he uses it for 


the first time, since the costs of migration will be 
higher both in terms of time and money. This 
serious issue is tackled in many different ways. 
One proposal is to provide a unified API which 
solves the vendor lock-in at PaaS and IaaS level 
and another proposal is the usage of Containers 
which are pretty much the same as Virtual 
machines but are more lightweight. These 
containers solve the problem at the IaaS level by 
combining the entire necessary environment 
needed to run and deploy an application leading to 
a smoother migration to another CSP but still it 
nevertheless faced the problem of locking in to 
that container and was also unable to solve the 
problem at PaaS level. 

2. LITERATURE SURVEY [4] 

The standardized API is based on an already 
existing system which is called as the Meta cloud. 
The standardized API derives some architecture of 
the Meta cloud. But provides a more fine grained 
control over the API and standardized for building 
most cloud services APIs. Some of the terms are 
below: 

Cloud: Cloud is network of computers which 
provide several services to the end user where in 
the services are distributed to run and share data 
over a network i.e. the services can utilize the 
computing power of various computers connected 
to the cloud on-demand so that there is no wastage 
of any resources. Most of the applications are now 
moving to the cloud because of scalability. 

IaaS: This stands for Infrastructure as a Service 
which means the cloud vendor provides only the 
infrastructure i.e. Hardware needed to build and 
deploy applications. The customer needs to 
install his own software on the top of the 
vendor ’ s infrastructure. 
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PaaS: This stands for Platform-As-A-Service 
which means that the vendor provides an 
environment or a development platform to build 
and deploy an application. For example, the 
vendor may provide Debian Linux with Java 
installed over it. So, if the customer wants to 
develop a Java based application, he/she need not 
bother about the head ache installing Java software 
or Operating system into the vendor’s computer. 

SaaS: This stands for Software-As-A-Service 
which means that the vendor provides software 
for the end user to access to. Generally, these are 
CRM software that runs on the cloud. Some of the 
examples are Google Search, Facebook, and 
Salesforce etc. All of these come under Software 
as a service. These are most widely used. 

Migration from one cloud to the other provider in 
case of SaaS may seem complex, but generally it 
is dependent completely upon the vendor and how 
well they facilitate migration. 

Different methods of 
migration Techniques 

Live Migration can be a method that migrates the 
whole OS of one bodily system to a special. The 
digital system are migrated spirited while now not 
disrupting the appliances trolling on that. 

The benefits of digital device migration 
encompass conservation of physical server, load 
equalization the various bodily servers and failure 
tolerance simply in case of sudden failure. The 
numerous virtual machine migration techniques 
are as follows 

Fault Tolerant Migration Techniques 

Fault tolerance permits the virtual machines to 
continue its task even any part of device fails. 

This approach migrates the digital system from 
one physical server to a extraordinary physical 
server based totally upon the prediction of the 
failure befell, fault tolerant migration technique 
is to boost the supply of physical server and 
avoids performance degradation of applications 

Load Balancing Migration Techniques 

The weight balancing (or) leveling migration 
approach ambitions to distribute load across the 
distribute load across the physical servers to 
beautify the measurability of physical servers 
in cloud surroundings, the weight leveling aids in 
minimizing the resource intake, implementation of 
fail-over, improving measurability, keeping off 


bottlenecks and over provisioning of resources and 
so forth. 

Energy Efficient Migration 
Techniques 

The strength consumption of data middle is 
basically supported using the servers and their 
cooling structures. The servers generally up to 
seventy percentage of their most electricity 
consumption even at their low usage degree. 
Therefore, there’s a demand for migration 
techniques that conserves the energy of servers 
by most fulfilling aid usage. 

LIVE VM MIGRATION IN CLOUD [3] 

Live migration is an especially powerful tool for 
cluster and cloud administrator. Associate 
administrator will migrate OS instances with 
application so the machine will be used for 
physical functions. 

There main 2 major approaches: Post-Copy 
memory and Pre-Copy memory migration. In 
the Post-copy memory migration approach 
it first suspends the migrating Virtual Server at the 
supply facet then once copies bottom processor 
state to the target host and resumes the virtual 
machine, and begins winning memory pages over 
the network from the supply node. There are two 
sections in Pre-copy approach: Pre-copy phase 
and Stop-and-Copy phase. In heat up VM memory 
migration section, the hypervisor copies all the 
memory pages from supply to destination whereas 
the VM remains running on the supply. If some 
memory pages amendment throughout memory 
copy method dirty pages, they'll be re-copied till 
the speed of recopied pages isn't but page change 
of state rate. In Stop and copy section, the VM 
are stopped in supply and also the remaining dirty 
pages are traced to the destination and VM are 
resumed in destination. 

Pre-Copy Phase [5]: At this stage, the VM 
continues to run, whereas its memory is iteratively 
traced page wise from the supply to the target 
host. Iteratively means that, the algorithmic 
rule works in many rounds. It starts with 
transferring all active memory pages. As every 
spherical takes your time and within the in the 
meantime the VM remains running on the 
supply host, some pages could also be dirtied 
and need to be resent in a further spherical 
making certain memory consistency. 

Pre-Copy Termination Phase: Without any stop 
condition, the iteratively pre-copy part could 
persevere indefinitely. Stop conditions rely 
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extremely on the look of the used hypervisor, 
however usually take one in all the subsequent 
thresholds into account: the amount of performed 
iterations exceeds a pre-defined threshold, the 
entire quantity of memory that has already been 
transmitted, exceeds a pre-defined threshold. 


Stop-and-Copy Phase 

At this stage the hypervisor suspends the VM to 
prevent page dirtying and copies the remaining 
dirty pages and additionally because the state of 
the CPU registers to the destination host. Once the 
migration method is completed, the hypervisor on 
the target host resumes the VM. 
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Post-Copy Approach: Post-copy VM migration 
is initiated by suspending the VM at the supply, by 
suspending the VM a lowest subset of the 
execution state of the VM (CPU registers and non- 
pageable memory) is transferred to the target. The 
VM is then resumed at the target, albeit most of 
the memory state of the VM still resides at the 
source. At the target, once the VM tries to access 
pages that haven’t nevertheless been transferred, it 
generates page-faults. 

These two faults are cornered at the target 
and redirected towards the supply over particular 
network. Such faults are said as network faults. 

The supply host responds to the network-fault 
by causing the faulted page. Since every page fault 
of the running VM is redirected towards the 
supply, this system will degrade performance of 
applications running within the VM. However, 
pure demand-paging attended with techniques like 
pre-paging will scale back this impact by a good 
extent. 
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Meta Cloud Architecture [1] 

Meta cloud: This APIs can tackle the key-value 
pairs which stores the information about various 
cloud service providers which are referred by the 
developer while recoding. It is basically an 
architecture which contains Provisioning strategy, 
Knowledge base, Resource monitoring, Resource 
templates, Migration and deployment recipes, 
Meta cloud API and Meta cloud Proxy. 

Resource templates: These describe cloud 
services required for running an application. This 
includes the services needed for an application, 
how each service interact with each other and any 
dependencies between each services, how well 
they are wired together etc. 

Migration and deployment recipes: This 
includes what are the necessary tools or packages 
needed to migrate from one cloud to the other, any 
services required by the application. 

Resource monitoring: This is a part of the API 
which provides monitoring various cloud services. 
The services can either go through the 
standardized REST API which makes use of the 
HTTP/HTTPS protocol or can come up with their 
own protocol. Monitoring includes how much 
bandwidth the application used, how many 
instances are deployed and running, active etc. 



Figure I . Conceptual mera cloud overview. Developers create cloud 
applications using meza cloud development components- The mete cloud 
runtime o Pstracts from provider specifics using proxy objects, and automates 
application Jife-cyde management. 


3. PROPOSED SOLUTION FOR 
VENDOR LOCK-IN 

Along with these above architecture specified in 
Winds of change, we have eliminated the need for 
Meta cloud proxy, Knowledge base. Since the 
focus of the paper is to provide standardized set of 
classes which can be used as the base for building 

any cloud service provider API. The API can well 
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be divided into two parts: one is the database part 
and the other is the Basic services part. We are 
trying to eliminate the lock-in of data to a 
particular CSP in the database part and in the basic 
part we are trying to eliminate the lock-in of the 
application to a specific vendor. 

3.1 STANDARDS FOR BASIC CLOUD 
SERVICES 

There is vast number of cloud services 
like Database-as-service, Java-As-Service etc. 
More or less, they can all be well grained into 
database and programming languages though there 
were many in SaaS architecture, the focus of the 
paper is to eliminate lock-in only in the PaaS 
model. Most of the databases are classified into 
either SQL Databases or NoSQL databases, each 
having their own model. Moreover, NOSQL 
databases can be classified into Key-Value stores, 
JSON models. When a PaaS service is chosen, 
migrating from it involves changing the 
connection code to that of the migrating CSP. This 
is not of a problem. The database architectural 
changes involve changing of the query commands. 
For example, SQL queries of Oracle differ with 
that of the MySQL in which case ORM 
frameworks can come into rescues which are 
already existent. 

The one thing that definitely differs from one CSP 
to other CSP while migrating is the connection 
part of the code. This is the minimal change and 
can never be a problem of concern. Therefore, 
what needs to be standardized is the way of 
interacting with each CSP i.e. without changing 
the classes and much part of the code. 

4. MODULES OF 
STANDARDIZED API 

4.1 Database module 

Here there will be a set of classes which are 
needed to interact with the database-as-service 
model of the Cloud service provider. To run an 
application database is very important and at times 
there may be a need to migrate to other database, 
may be to a different CSP. 

In this case we are utilizing the existing 
ORM frameworks to interact with the databases. 
The ORM framework is one way of standardizing 
several databases since there is a conversion of a 
relation model to an object. Dependencies between 
table columns are specified in terms of 
dependencies between different instance variables 


of different classes. SO for any database migration 
(only the relational databases), like from MySQL 
to Oracle or Oracle to SQL Server etc. can be 
easily facilitated by changing only the database 
specific connection code. 

For NoSQL databases there are several 
formats. One is the Key-Value stores and the other 
is the JSON object notation format. Programming 
languages like Java already support the JSON 
object notation. So these set of classes can be used 
to convert Java objects to JSON format easily. The 
only part of work needed is that the cloud service 
providers for JSON type NOSQL databases must 
build their API depending upon these classes. 

For Key-Value pair databases like Redis some 
classes in Java like Map and other utility classes 
can be served as a base. The part of the work again 
here is that the CSPs must conform to these 
standard classes.Since the architecture is already 
persistent, the work needed is to change the 
connection code which is discussed below. 

4.2 Basic Services module 

Here the basic services are actually more. 
Different CSPs provide different types of services 
and features. It is not possible to combine all of 
them. However, the API provides flexibility to the 
vendors to add additional features using these 
standardized APIs.For the basic connection codes, 
the standardized REST APIs are already present 
and are also supported by many CSPs and also 
many programming languages. 

Apart from this, the CSPs can also choose their 
own protocols to interact with their services. In 
this case, the configuration details are enclosed in 
a separate file or database and are loaded from 
there. The API classes load those details and 
perform a request to those services as stated in the 
configuration. So, the basic services module 
eliminates the vendor lock-in problem of the 
application code. 


4.3 Extensions module 

The API also provides a way to facilitate adding 
additional capabilities so that the cloud service 
providers can extend the API and provide their 
own classes and interfaces to interact with 
additional features of their cloud service. The way 
the extensions module is provided is as a package 
of classes which facilitate the CSP to add 
additional classes. Any form of extension, could 
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be created as a package in the CSPs API and may 
involve several classes excluding the standardized 
API. Since any of the extension is some form of 
network activity (if it is an additional feature of 
the Cloud services itself, rather than the API) then, 
there are always protocols and classes necessary to 
access them. 

For API specific additions, every extension class 
must inherit/implement the classes/interfaces 
(respectively) which term those classes in the API 
as extension classes. The extension module is 
nothing but a set of extension classes which are 
clubbed in a package. Any extension class that the 
CSP API would like to implement will have to 
implement the Extension interface. 

5. EXPECTED RESULTS 

The result of the standardized API would force the 
vendors to stick to specific standards but that does 
not restrict them from adding additional features to 
their cloud because of the extension module. This 
would result in an easier migration from one cloud 
to the other cloud. Since there is always 
information necessary to require migrating to a 
particular cloud and recipes are already stated in 
the architecture, the migration would only require 
changing the configuration file which deals with 
the information needed to connect to a particular 
cloud service provider. The standardized API 
made it simple to migrate to different cloud just 
like the JDBC did to migrate from one database to 
the other, just at the cost of changing the 
connection information. 

6. CONCLUSION 

The proposed solution for eliminate the Vendor 
lock-in solution must be implemented at the 
programming language level. Just like the Java 
programming language decided to come up with a 
JDBC API in the past to interact with databases 
from Java programming language which are 
conformed by all the database providers, this 
standardized API must also be implemented as a 
part of the programming language so that the 
CSPs conform to those and build their services 
based upon these set of classes. Rather than the 
customer locked in to the vendor, the vendor is 
locked in to this standardized API. The API 
covers most (if not all) of the cloud services by 
classifying them into basic categories and also 
provides a way to add additional functionality for 


many other features that the Cloud service 
providers has to offer. 
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Abstract 

With the enhance in the digital media, modification 
and transfer of information is very easy. So this 
work focus on transferring data by hiding in the 
image. Here a robust approach is achieved by using 
the skew tent map as an encryption/ decryption 
algorithm at the sender and receiver side. In this 
work image is transformed into inverse S-order as 
the initial step of the work so little confusion can be 
created for the intruder. Here whole data hiding is 
done by modifying by using the modified histogram 
shifting method. This approach was utilized to the 
point that hiding information and image can be 
effectively recovered with no information loss. An 
investigation is done on the genuine dataset image. 
Assessment parameter esteems and demonstrates 
that the proposed work has kept up the SNR, PSNR, 
Throughput, Data Hiding Execution Time and 
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Extraction Time values with high security of the 
information. 

Keywords—Digital data hiding, Encryption, 

Histogram, Image Processing. 

1. Introduction 

As the web is developing definitely clients are 
drawn in by different specialist organizations step 
by step. Some of online shops, computerized 
showcasing, informal organization [21, 23] and so 
on. This simple access prompts change the 
proprietorship effectively, as clients can steal other 
work and make computerized printed with their 
names. In any case, this innovation offer ascent to 
new issue of piracy [12, 18]. 

To conquer this issue numerous approaches were 
recommended and restrictive of the advanced 
information is protected. So to defeat this distinctive 
strategies are used for safeguarding the restrictive of 
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the proprietor. Out of many methodologies, 
advanced information inserting which is otherwise 
called computerized Data Hiding assumes a critical 
part. Keeping in mind the end goal to give 
proprietorship of the information proprietor, 
advanced information was implanted into the 
image, video, or information as in [11, 19, 23], 

One of the real fundamentals of information hiding 
is that the hidden information must be hazy. The 
utilization of stenography has many points of 
interest and are extremely helpful in computerized 
picture handling which makes them appropriate for 
a wide collection of uses. In this cutting edge 
region, digital incredible comfort in transmitting a 
lot of information in various parts of the world [1, 
15]. In any case, the wellbeing and security of long 
separation correspondence remain an issue. Keeping 
in mind the end goal to take care of this issue of 
security and wellbeing has promoted the 
advancement of stenography plans. Stenography is 
not quite the same as watermarking and 
cryptography [2, 16]. The fundamental target of 
stenography is to conceal the presence of the 
message itself, which makes it troublesome for a 
spectator to make sense of where precisely the 
message is. Then again, cryptographic systems have 
a tendency to secure correspondences by changing 
the information into a frame with the goal that it 
can't be comprehended by a meddler [6, 22], Also, 
in watermarking logo is more critical than data. 
Stenography is the sort of concealed 


correspondence that signifies "secured expressing" 
[5, 10] (from the Greek words stego or "secured" 
and graphos or "to write"). 

Information hiding is the procedure to cover 
information in a cover media. In this way, the 
information, concealing procedure contains two 
sorts of information [9, 13], embedded information 
and cover media information. The information is 
transmitted by implanting it inside Images, which 
enhances information security. The information, 
concealing strategy in which the reversibility can be 
accomplished is called Reversible information 
hiding [14, 17]. This method is used to improve the 
security of the cover Image [21] in encryption. 
Reversible image [21] data hiding (RIDH) [22] is 
one strategy for the information, concealing 
procedure, which ensures that the cover picture is 
recreated flawlessly after the removal of the 
implanted message. The reversibility of this 
technique makes the information, concealing 
methodology attractive in the basic situations, e.g., 
remote detecting, military, law, crime scene 
investigation, medical picture sharing [22, 24] and 
copyright confirmation, where the original cover 
picture is required after remaking [1], 

2. Related Work 

In [18] digital data were embedded in the selected 
portion of the image where the edge region was 
selected for embedding. Here paper has developed a 
new approach to finding pixel representing edges. 
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By utilizing Dam and BCV approach image was 
segmented into edge and non edge region. One 
drawback of this work that it was done on binary 
image only means data hiding can be done by this 
method for binary image only. This lead to one 
more limitation that is embedded data should of 
binary form only. With above issues image was 
highly robust against different types of attacks like 
filter [7, 16], noise, etc. 

In [4] author has extended the work done in [18] by 
increasing the overall capacity of the embedding 
data space. Here at Dam and BCV technique author 
start looking at the surrounding of the edge region 
pixel. So the overall capacity of the data hiding was 
drastically increased in this paper. Here even after 
embedding more data embedded image was robust 
against different types of attack as well. 

In [21] self embedding concept was proposed by the 
authors where the image itself generate the data for 
embedding while in order to protect data in network 
fountain codes were developed for lost packed 
regeneration. As in fountain codes more than one 
required packet format was sent on the network, 
which help in regenerating the missed or corrupt 
data packets. Here work has great limitation being 
that after embedding the image is not available in 
original format before extraction. So main purpose 
of this work is for transferring the data packet from 
sender to receiver only. 


In [20] same concept of image Data Hiding self 
generation was done, here image was so utilized 
that it generates its own Data Hiding information. 
This paper center of attention on the image 
development where the spatial area was utilized for 
inserting the digital data as a carrier object. At the 
same time, similar information is required at the 
receiver which help in finding the digital data back. 
But to wrap both intra-code block and inter-code 
block method utilizes. 

In [12] authors utilize the DWT feature for finding 
the pixel value for embedding. While in order to 
increase the randomness in the embedding the 
selection of images was not sequential but it would 
utilize the random Gaussian function for selecting 
pixel of different position. At the receiver side with 
the help of some supporting information it was 
found that Data Hiding [3, 8] was extracted from 
the image. Here it was obtained that both Data 
Hiding and image got reverse at the receiving end. 

In [25] author adopts KSVD technique for 
embedding the digital data. Here by utilizing the 
RC$ algorithm encryption of the digital data was 
done. Here one dictionary was maintained at the 
receiver and the transmitter end for reducing the 
size of the carrier signal. In this work after 
embedding some vacant space between the data was 
utilized in the data embedding. This work has given 
freedom for the extraction of image or digital data 
or both in any order. 
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3. Proposed Methodology 

Center of attention of this work was to hide digital 
data in the image. The whole work was done in two 
steps of embedding digital data and extraction of 
digital data. Here it is desired that while extracting 
digital data [21, 25]. In Fig. 1 whole embedding 
work block diagram is described. 



Fig.l Block diagram of proposed work. 

3.1 Pre-Processing 

A collection of different number on fixed range 
represents a type of image format. So reading pixel 
values of that image matrix is done in this step of 
the proposed model. 


As whole work focus on the image which has a 
pixel value in the scope of 0-255. So examine an 
image implies building a framework of the same. 
Measurements of the image at that point fill the 
matrix cell to the pixel value of the image at the cell 
in the grid. 

3.2 Inverse S-Order 

In this step all the color channels Red, Green and 
Blue are into single dimension matrix or vector S. 
Here as per inverses s-order sequence of pixel value 
is inserted in the S vector. This can be understood 
by below example where fig. 2 (a) represent 
original image and fig. 2 (b) represent the inverse S- 
order of the matrix. 
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S= [4, 5, 6, 4, 6, 6, 6, 5, 4, 7, 8, 9, 5, 4, 6, 9] 

Fig. 2 Inverse S-order representation. 


In this way whole image pixel values are arranged 
in the single vector S where the order of the pixel 
values is the inverse S-order. In case of the color 
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image, the first red matrix is insert than green is 
inserted and at last, blue matrix is inserted. 

3.3 Image Histogram 

In this step S vector obtained after inverse S-order 
is used where histogram of the image is found at 
one bins. This can be understood as let the scale of 
color in fig. 2 is 1 to 10, than count of each pixel 
value is done in the image. So as per above S vector 
Hi = [0, 0, 0, 4, 3, 5, 2, 1, 2, 0] where H represents 
the color pixel value count and i represent the 
position in the H matrix with color value. 


Histogram of orginal Image 



Fig. 3 Histogram of the original image. 

3.4 Histogram Shifting and Data Hiding 

In order to make reversible data hiding this work 
adopt a histogram shifting method for data hiding in 
the image. From above step pixel value with 
number of presence is obtained where pixel having 
largest presence or the highest peak in the histogram 
is P = {6}. In similar fashion pixel having a zero 
presence in the image is Z= {1, 2, 3, 10}. 


Histogram shifting is obtained by manipulating the 
peak value with zero presence pixel value, but this 
makes one limitation that numbers of data hiding 
bits are less. This can be understood as P= {6} 
where pixel value 6 is present in 5 locations of the S 
vector, so maximum 6 bit data can be hidden in this 
image carrier. So in order to increase the number of 
positions in the image proposed work has included 
other peak of the histogram for increasing the 
hiding capacity. This can be understood as if peak 
vector includes other pixel values let P= {6, 4, 5, 7} 
than total 12 bits can be hide in the image while 
replacement of the peak value are done by its zero 
value vector Z= {1, 2, 3, 10}. 

3.5 Data Hiding 

Here histogram shifting is done in hiding each bit of 
the data. This shifting means replacing the peak 
pixel values with its corresponding zero pixel value. 
Let hiding data be H = [1, 0, 0, 1], As per histogram 
shifting if bit 1 come in hiding data than the peak 
value to remain unaffected while when bit 0 come 
in hiding data than replace peak value with a zero 
value. 

S= [4, 5, 6, 4, 6, 6, 6, 5, 4, 7, 8, 9, 5, 4, 6, 9] 

1 0 0 1 

HS= [4, 5, 6, 4, 1, 1, 6, 5, 4, 7, 8, 9, 5, 4, 6, 9] 
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Histogram of orginal Image 
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Fig. 4 Histogram of the original image. 

So the steps of hiding the data are 
Input: S, H, P, Z 
Output: HS // Hided Data 

1. Pos=l 

2. c=l // c peak value position 

3. Loop 1 :H // For each hiding bit 

4. p<-P[c] 

5. zfZ[c] 

6. If H==0 then 

7. Loop pos : n // n number of pixel values 
in S 

8. IfS[pos] = p 

9. Jump step 12 

10. End If 

11. End Loop 

12. S[pos] <rz 

13. End If 

14. Pos<-pos+l 

15. End Loop 

3.6 Skew Tent Map 


Finally obtained HS vector which contains secret 
data is handled in this step for encryption. Here 
Skew tent map algorithm takes 128 bit input as the 
key and HS vector as the plaintext. Before 
encryption some basic steps of skew tent need to do 
for getting various constants. 

Step 1. Key 128 bits are divided into four sub keys 
termed as Kl, K2, K3, K4 where each sub key 
contain 32 bits. 

Step 2. To ascertain the underlying condition X0 for 
the principal skew tent guide, pick any two pieces 
of session keys i.e. Kl and K3. Now process a 
genuine number X01 utilizing the XOR operation 
between them: 

Xoi = XOR (Ki, K 2 ) 

Further register another genuine number X02 as 
takes after Where n speak to a number of keys. 


n 



Where n represent a number of keys. 

X0 = (X01 + X02) mod 128 
With the underlying condition Xi by T times and 
refreshes Xi to the most recent status and also 
repeats the second skew tent map but one time for 
Yi given as below: 
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Yi = T(Xi) 

where Ti = T 0 for i=l and Ti = Tm 



Fig.5 Block outline of information extraction at 
the recipient end. 

Now the updated values of Xi and Yi are used to 
encrypt and decrypt the ith plaintext and cipher text 
block given as below: 

Ci = XOR (Pi, Ci-i, Xi) Encryption formula 

Pi = XOR (Ci, Ci-i, Yi) Decryption 


This section of proposed work is for image 
extraction at the receiver side. Here first skew tent 
map decryption algorithm is used for getting 
decrypted data. So resultant series obtained is taken 
as input in histogram shifting where peak and zero 
key pair is passing. After this hiding data is 
obtained. 

In this way all the plaintexts in the form of bits are 
combined to make secret data. Now ASCII values 
are converted into corresponding characters. At the 
end decrypted data are arranged in matrix form 
where vector output after decryption of the skew 
tent map method. 

4. Experiment and Result Analysis 

This section represents the experimental assessment 
of the proposed Embedding and removal technique 
for confidentiality of the image. All algorithms and 
efficacy procedures were implemented using the 
MATLAB tool on an 4 GB RAM and Windows 7 
Professional based 2.27 GHz Intel Core i3 
processor. 

4.1 Dataset 

An Experiment completed on the ordinary images 
such as Baboon, Lena, Boat, Peppers, etc. These are 
standard images which are derived from 
http: //sipi .use. edu/database/?volume=misc. The 

system is tested on day to day images as well. 


3.7 Extraction steps 

In this extraction steps receiver can extract data and 
image by using an above block diagram. 

3.8 Extraction of Image 
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Fig. 7 Lena 



Fig 9. Peppers 


4.2 Evaluation Parameter 


• Peak Signal to Noise Ratio 

It is a base 10 logarithm ratio of the max pixel value 
to the mean square error. 


PSNR = 10 log 


10 


' Max_pixel_value ^ 
v Mean _ Square _ error 

• Signal to Noise Ratio 

It is a base 10 logarithm ratio of signal to noise. 

^ Sign aP 


SNR = 10 log, 


>10 


Noise 


• Extraction Rate 

It is a percentage ratio of the number of true pixels 
to total number of pixels present in data hiding. 


>7 = —xl00 


Here n c is the number of pixels which are true. 

Here n a is the total number of pixels present in Data 
Hiding. 

• Throughput 

It is a ratio of number of completions in bytes to the 
execution time. 
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Throughput 




No. _of_ Completion _ in _ Bytes 
Execution Time 


4.3 Results Analysis 

The execution is obtained in the platform of Matlab 
with various images. The implementation results are 
analyzed on the basis of the Signal to Noise Ratio 
(SNR), Peak Signal to Noise Ratio (PSNR), Data 
hiding execution time, Data hiding extraction time, 
Extraction Rate and Throughput. The performance 
of the proposed algorithm is compared with existing 
technique on the basis of PSNR, Data hiding 
execution time and Throughput. 

Table 1. SNR of Proposed Work 


SNR of Proposed Work 

Images 

Proposed Work 

Baboon 

29.0555 

Lena 

34.2709 

Boat 

27.7464 

Peppers 

33.3673 


Here, the table 1 illustrates that the corollary is 
attained by means of a Watermarking arrangement 
in devoid of Noise. In this manner, we acquired the 
SNR value as 29.0555 through devoid of Noise in 
Baboon, the SNR value as 34.2709 through devoid 
of Noise in Lena, the SNR value as 27.7464 through 
devoid of Noise in Boat, the SNR value as 33.3673 
through devoid of Noise in Peppers. Figure 10 
represents the SNR values for various images. 



Fig. 10. SNR of Proposed Work 
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Table 2. Data Hiding Extraction Time of 
Proposed Work 


File Size 

(KB) 

Data Hiding 

Extraction Time 

(ms) 

50 

0.8974 

250 

2.6375 

500 

3.3398 

1000 

9.8292 

5000 

26.9362 

10000 

43.2534 


Here, table 2 shows the data hiding extraction time 
value as 0.8974 for 50 KB size of the file, the data 
hiding extraction time value as 2.6375 for 250 KB 
size of the file, the data hiding execution time value 

Table 3. SNR of proposed Work under the 


consideration of Noise and Filter Attack 


SNR 

Images 

Noise Attack 

Filter Attack 

Baboon 

17.5420 

15.9373 

Lena 

16.2348 

15.2318 

Boat 

17.2797 

13.3664 

Peppers 

17.5202 

15.5973 


Here, the table3 illustrates that the corollary is 
attained by means of a Watermarking arrangement 
in devoid of Noise. In this manner, we acquired the 
SNR value as 17.5420 and 1.9373 through devoid 
of Noise in the baboon, the SNR value as 16.2348 
and 15.2318 through devoid of Noise in Lena, the 


as 3.3398 for 500 KB size of the file, the data 
hiding execution time value as 9.8292 for 1000 KB 
size of the file, the data hiding execution time value 
as 26.9362 for 5000 KB size of the file, the data 
hiding execution time value as 43.2534 for 10000 
KB size of the file. Figure 11 represents the Data 
hiding extraction time for various file sizes. 



Fig. 11. Data Hiding Extraction Time of 
Proposed Work 


SNR value as 17.2797 and 13.3664 through devoid 
of Noise in Boat, the SNR value as 17.5202 and 
15.5973 through devoid of Noise in Peppers under 
the consideration of Noise Attack and Filter Attack 
respectively. Figure 12 represents the SNR values 
for various images under the Noise and Filter 
Attack. 



Fig. 12. SNR of proposed Work under the 
consideration of Noise and Filter Attack 
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Table 4. Extraction Rate of proposed Work 
under the consideration of Noise and Filter 
Attack 


Extraction Rate 

Images 

Noise Attack 

Filter Attack 

Baboon 

56.9444 

49.7222 

Lena 

61.1111 

51.3889 

Boat 

54.1667 

46.1768 

Peppers 

58.4948 

50.8546 


Here, the table4 illustrates that the corollary is 
attained by means of a Watermarking 
arrangement in devoid of Noise. In this manner, 
we acquired the extraction rate as 56.9444 and 
49.7222 through devoid of Noise in the baboon, 
the extraction rate as 61.1111 and 51.3889 
through devoid of Noise in Lena, the extraction 
rate as 54.1667 and 46.1768 through devoid of 
Noise in Boat, the extraction rate as 58.4948 and 
50.8546 through devoid of Noise in Peppers 

under the consideration of Noise Attack and 
Filter Attack respectively. Figure 13 represents 
the extraction rate for various images under the 
Noise and Filter Attack. 

Here, the table5 illustrates that the corollary is 
attained by means of a Watermarking 
arrangement in devoid of Noise. In this manner, 
we acquired the PSNR value as 57.1464 and 



Fig. 13. Extraction Rate of proposed Work 
under the consideration of Noise and Filter 
Attack 

Table 5. PSNR of proposed Work under the 
consideration of Noise and Filter Attack 


PSNR 

Images 

Noise Attack 

Filter Attack 

Baboon 

57.1464 

55.1864 

Lena 

58.8634 

54.1357 

Boat 

55.9453 

54.8743 

Peppers 

53.8231 

51.3633 


55.1864 through devoid of Noise in the baboon, 
the PSNR value as 58.8634 and 54.1357 
through devoid of Noise in Lena, the PSNR 
value as 55.9453 and 54.8743 through devoid 
of Noise in Boat, the PSNR value as 53.8231 
and 51.3633 through devoid of Noise in 
Peppers under the consideration of Noise 
Attack and Filter Attack respectively. Figure 14 
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represents the PSNR values for various images 
under the Noise and Filter Attack. 



Fig.14. PSNR of proposed Work under the 
consideration of Noise and Filter Attack 


PSNR value as 80.4425 through devoid of 
Noise in Lena, the PSNR value as 75.7464 
through devoid of Noise in Boat, the PSNR 
value as 79.3673 through devoid of Noise in 
Peppers. From table 6 it is described that under 
ideal condition proposed work is better than to 
previous work in [6] under PSNR evaluation 
parameters. As skew tent and histogram shifting 
algorithm has regenerate images in color format 
only so this parameter is high as compare to 
previous value. Figure 15 represents a 
comparison of the PSNR values for various 
images between proposed work and A-S 
algorithm [6]. 


Table 6. PSNR Based Comparison between 
proposed and previous work (A-S 
Algorithm) [6]. 


Images 

A-S Algorithm [6] 

Proposed Work 

Baboon 

49.2 

77.0555 

Lena 

57 

80.4425 

Boat 

48.3 

75.7464 

Peppers 

52.5 

79.3673 


Here, the table6 illustrates that the corollary is 
attained by means of a Watermarking 
arrangement in devoid of Noise. In this manner, 
we acquired the PSNR value as 77.0555 
through devoid of Noise in the baboon, the 


PSNR 



Images 


Fig. 15.PSNR Based Comparison between 
proposed and previous work (A-S Algorithm) 
[ 6 ] 

Table 7. Data Hiding Execution Time 
comparison between proposed and previous 
works [6]. 
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File Size 

m> 

Data Hiding Exec ution Time Comparison (ms) 

3BES[6] 

IDEA [6] 

CAST- 12 a 

m 

A-S 

Algorithm 

[<5] 

Proposed Work 

50 

120 

49 

43 

2 

0.63409 

250 

170 

69 

43 

4 

1.6332 

500 

232 

101 

73 

7 

43343 

1000 

412 

190 

96 

12 

7.3272 

5000 

1190 

621 

417 

34 

23.7161 

10000 

2307 

1039 

713 

62 

412713 


Here, table 7 shows the data hiding execution 
time values as 0.65409 for 50 KB size of the 
file, the data hiding execution time values as 
1.6352 for 250 KB size of the file, the data 
hiding execution time values as 4.3348 for 500 
KB size of the file, the data hiding execution 



Fig. 16. Data Hiding Execution Time comparison 
between the proposed and previous work [6]. 
Table 8. Throughput comparison between 
proposed and previous works [6]. 


time values as 7.8272 for 1000 KB size of the 
file, the data hiding execution time values as 
25.7161 for 5000 KB size of the file, the data 
hiding execution time values as 41.2718 for 
10000 KB size of the file. From table 7 it is 
described that under ideal condition proposed 
work is better as than previous work [6], As 
propose work regenerate dictionary from the 
same data so the execution time for the same is 
less as compared to previous work. Figure 16 
represents a comparison of the Data Hiding 
Execution Time of various file sizes between 
proposed work and several previous algorithms 
[ 6 ], 


}?ile Size 
(KB) 

Throughput Comparison (MBPS) 

3DE5 [6] 

IDEA [6] 

CAST -128 

[6] 

A-S 

Algorithm [6] 

Proposed 

Work 

50 

0.4167 

1.0204 

1.1111 

25 

76.4421 

250 

1.4706 

3.6232 

5.2083 

62.5 

102.8865 

500 

2.1552 

4.9505 

6.8493 

71.4286 

115.3456 

1000 

2.4272 

5.2632 

10.4167 

83.3333 

127.7596 

5000 

4.2017 

8.0515 

11.9904 

147.0588 

194.4307 

10000 

3.9888 

9.4429 

14.0252 

161.2903 

242.2962 


Here, table 8 shows the Throughput value as 
76.4421 for 50 KB size of the file, the Throughput 
value as 102.8865 for 250 KB size of the file, the 
Throughput value as 115.3456 for 500 KB size of 
the file, the Throughput value as 127.7596 for 1000 
KB size of the file, the Throughput value as 
194.4307 for 5000 KB size of the file, the 
Throughput value as 242.2962 for 10000 KB size of 
the file. From table 8 it is described that under ideal 
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condition proposed work is better than previous 
work in [6]. As proposed work regenerate 
dictionary from the same data so the execution time 
for the same is less than previous work. Figure 17 
represents a comparison of the Throughput for 
various file sizes between proposed work and 
several previous algorithms [6], 



Fig. 17. Throughput comparison between 
proposed and previous works [6]. 
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Abstract — In the current scenario compression of video files is in 
high demand. Color video compression has become a significant 
technology to lessen the memory space and to decrease 
transmission time. Video compression using fractal technique is 
based on self similarity concept by comparing the range block 
and domain block. However, its computational complexity is very 
high. In this paper we presented hybrid video compression 
technique to compress Audio/Video Interleaved file and 
overcome the problem of Computational complexity. We 
implemented Discrete Wavelet Transform and hybrid fractal HV 
partition technique using Particle Swarm Optimization (called 
mapping of PSO) for compression of videos. The analysis 
demonstrate that hybrid technique gives a very good speed up to 
compress video and achieve Peak Signal to Noise Ratio. 

Keywords- Video Compression, DWT\ Fractal , HV partitioning , 
MATLAB, MSE, PSO,PSNR,ET,CR. 

I. Introduction 

In current scenario digital video plays very, important role in 
information technology, including [1] teleconferencing, 
broadcasting, military applications, entertainment and many 
more [1,2 ]. People need to access video very quickly and 
within a limited period of time through various digital devices 
[3, 4]. To deal with this situation compression of video file 
[5, 6] is the necessity. The DWT transform [1, 3, 9] form the 
layers of frames in terms of group of frames [2, 4]. The 
processing of frames [1, 2, 10] in layer is very slow for the 
compression [6]. Due to slow compression encoding of video 
[5, 6, 7] is major problem in DWT based video compression 
[7, 8]. For the encoding and fast processing [3, 5] transform 
function [7] used partition process in terms of horizontal and 


vertical [6, 8] for the local processing of layers frames in 
different groups of frames. The reduction of search space 
[5, 9] in terms of layers of block for coding used PSO (particle 
swarm optimization) [1, 3]. The PSO [8] reduces the layers 
space and decrease the encoding time [6] and reduces 
encoding time bust [5, 10] the performance of video 
compression [5]. In this proposed work we represent 
background of discrete wavelet transform [3, 5] and Fractal 
HV partition technique [5, 7] with PSO [7, 8] in segment II 
and segment III represent experimental results and lastly 
segment IV represent conclusion. 


II. BACKGROUND 

Discrete Wavelet Transform 

Wavelet transforms [1, 2, 10] is broadly used in computer 
vision [6, 9] as an image compression [4, 6, 8]. The 
phenomenon of wavelet is closely allied to multi-scale [2] and 
multi-resolution [4] application and it has been used into 
image fusion technique [5, 7]. Implementation of Discrete 
Wavelet Transform as an image processing method generates 
the transformation values called wavelet coefficient [5, 7]. The 
fundamental concept behind wavelets is to examine signal 
according to scale. During recent years, it has gained a lot 
of interest in the field of signal processing [5, 7], numerical 
analysis and mathematics [8]. In general, the wavelet 
transform is an advanced method of signal and image 
analysis [4, 5]. 
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Fractal HV Partitioning Technique and Mapping of 
PSO 

In a HV partition [2, 6, 8] a rectangular range square [8] can 
be part either on a level plane or vertically into two littler 
rectangles [4]. A choice about the split area must be made. 
While embraces a model in view of edge area [4], we take 
after and propose to part a rectangle with the end goal that an 
estimation by its DC segment (DC segment of a piece is 
characterized here as the square whose pixel esteems are 
equivalent [6] to the normal power of the square.) in each part 
gives an insignificant aggregate square mistake [5]. 
We anticipate that fractal coding will deliver moderately little 
collection mistakes with this decision since approximation by 
the DC segment [3] alone will as of now give little wholes of 
squared blunders by plan of the part conspire, and for the 
guess of the dynamic piece [6] of the range squares we have 
more areas accessible, if the range piece fluctuations 
are low [2]. 

The HV partition technique [3, 7] proceed the video data for 
the process of encoding in terms of domain and Range block 
in terms of column for the encoding [4] in terms of horizontal 
and vertical column of video data[l]. 

Here in this proposed work to reduce the searching time [6] 
between range and domain block we used PSO technique with 
HV partition technique [7] which reduces the searching time 
of block symmetry and increase the block symmetry. 



III. Experimental Results 

In this paper the proposed algorithms of DWT, Fractal 
transform and PSO algorithm [5] has been implemented using 
MATLAB 8.0 code. For testing varied audio/video interleaved 
videos [7], we used a configuration of desktop Intel processor 
with 1.86GZ with 2 GB of RAM [3] running on Windows 
2007.For the evaluation of the performance used some 
standard parameters such as PSNR, MSE, CR and ET of video 
[5, 7]. The measured parameter gives better result, instead of 
DWT based video compression [8]. For testing videos are 
obtained from CV vision library [6, 8]. All process we 
describe here. 

Description of Dataset 


Figure 2. Shows that the original and compressed video view ofbattle.avi 
video using mapping of PSO method. 

Also get the result of compression of PSNR, Compression 
Ratio, Mean Square Error and Encoding time for all the tested 
videos. 

The following table shows the comparison of DWT and 
Mapping of PSO of varied AVI videos with respect to: 

• CR-Compression Ratio[8] 

• MSE- Mean Square Error Rate[8] 

• PSNR- Peak Signal to Noise Ratio[8] 

• ET- Estimated Time[8] 


Table 1. Shows description of dataset used for compression of varied videos 


S.No. 

Video Name 

Format of video 

1 

Battle video 

Avi 

2 

Duck video 

Avi 

3 

Cartoonduck video 

Avi 

4 

Lab video 

Avi 

5 

Sumrf video 

Avi 

6 

Gunner video 

Avi 

7 

Airplane video 

Avi 


Table 2. Analysis for DWT and Mapping of PSO Method for 
battle.avi video 



DWT 

Mapping of PSO 

Compression 

Ratio 

0.77 

0.81 

MSE 

11.31 

11.09 

PSNR 

23.14 

25.14 

Encoding Time 

1.80 

1.92 
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Table 3. Analysis for DWT and Mapping of PSO Method for 
duck.avi video 



DWT 

Mapping of PSO 

Compression Ratio 

0.44 

0.59 

MSE 

11.55 

10.21 

PSNR 

22.34 

24.94 

Encoding Time 

0.52 

0.69 


Table 4. Analysis for DWT and Mapping of PSO Method for 
cartoonduck.avi video 



DWT 

Mapping of PSO 

Compression Ratio 

0.89 

0.96 

MSE 

18.74 

19.12 

PSNR 

18.48 

21.02 

Encoding Time 

0.76 

0.88 



Figure3: Shows the comparative performance of compression ratio using 
DWT and mapping of PSO method for battle.avi, duck.avi and 
cartoonduck.avi video 



Figure4. Shows the comparative performance of MSE using DWT and 
mapping of PSO method for battle.avi, duck.avi and cartoonduck.avi video 


Comparative Performance for PSNR using all method 



battle duck cratoonduck 


Figure5. Shows the comparative performance of PSNR using DWT and 
mapping of PSO method for battle.avi, duck.avi and cartoonduck.avi video 



Figure6. Shows the comparative performance of Encoding Time using DWT 
and mapping of PSO method for battle.avi, duck.avi and cartoonduck.avi 
video. 

IV. CONCLUSION 

It is essential to reduce the storage space and encoding time of 
video. From our experimentation and results it is conclude that 
the DWT transform function faced problem of distortion of 
layers, due to this reason the value of PSNR is decrease. The 
particle swarm optimization provides the dual searching mode 
and reduces the multi-scales H-V partition relation of blocks 
and references blocks. This reduces space speedup the 
compression technique and also remains the quality of video. 
The mapping of PSO also reduces the redundant frames of 
video and reduces the value of MSE and increase the value of 
PSNR. 
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Abstract—There is a strong connection between system success and 
early identification of users' needs.The Concept of Operation 
document (ConOps) is an important component for understanding 
any system clearly on early stages and for guiding a better change 
process from current state to desired state. There is no specific 
method that could aid in the process of collecting and analyzing the 
necessary information to create ConOps document, a lot of analysts 
face difficulties writing an efficient ConOps document .In this 
paper we argue that producing ConOps using ArchiMate helps in 
the process of creating ConOps clauses. We attempt to relate 
ConOps to ArchiMate Modeling language then prove that ConOps 
document and ArchiMate modeling language could represent each 
other efficiently. We also discuss how modeling the ConOps clauses 
using ArchiMate promotes better understandability and 
communication to understand the users' needs. 


Key words: ArchiMate, ConOps, Concept of operation, Enterprise 
Architecture. 


I. Introduction 

In a previous study, we examined the role of the 
motivation extension of ArchiMate [1] (ArchiMate is the 
modeling tool for Enterprise Architecture) and its relation to 
the adaptiveness of the Enterprise Architecture (EA) [2]. We 
believe that certain natural language documents could be a 
factor in the improvement of adaptive EA. Based on the 
previous examination, we found out that the Concept of 
Operation document (ConOps) is the ideal document to serve 
as bases for our research for so many reasons. The IEEE 
standardl362-1998 (IEEE Std 1362-1998) system definition 
concept of operation (ConOps) [3] document is widely used for 
its well-developed guidelines. It is used to guide the transition 
of systems; it provides a clear description of the current and 
desired system as well as the description of transition and 
change. In this research, we start by presenting research’s 
background and research problem in section II. Research 
methodology and questions in section III. 


Next, in order to prove our claim, which states that 
ArchiMate could aid the process of creating ConOps and vice 
versa, we first need to establish a connection between ConOps 
and ArchiMate by mapping elements and then evaluating the 
mapping and showing a simple demonstration of creating 
ArchiMate model from ConOps clause by analysis in section 
IV. In section V we present an experiment to show how using 
ArchiMate to create ConOps prompts accuracy and reduces 
difficulty. In section VI, an experimental case study is 
presented to confirm that using ArchiMate promotes efficient 
communication and better understanding. Finally, discussion is 
in section VII and conclusion and future research is in section 
VIII. 

II. Background and research problem 

There are a lot of benefits to creating ConOps document as 
basis for any system definition, not only do ConOps facilitate 
communication and consensus among stakeholders, it also play 
an important role in the whole development lifecycle, because 
it is used to derive requirements and later used to evaluate the 
system [11]. After system implementation, the system is 
validated and verified against the ConOps, which gives a 
baseline for measuring efficiency [12]. 

A. ConOps and the Enterprise Architecture 

ConOps IEEE 1362-1998 is the only concept of operation 
standard that describes current state of system, justifications for 
change and the concept for the proposed system, which 
contributes to guiding the transition from current state to future 
state [3]. This format could serve in tracing and applying 
changes in a consistent manner. When applying the 
Architecture Development Methodology (ADM), based on 
TOGAF 9.1 [7], there are three levels of partitioning for 
enterprise architecture in organizations: strategic level 
architecture (long term main objective and goal for the 
organization), segment level architecture (parts of the 
organizations such as departments) and capability level 
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architecture. ConOps can be related to each level as shown in 
Figure 1.Enterprise architecture can be applied to any of the 
segments (the segment by itself can be identified as an 
enterprise). As the organization matures each segment would 
create synergy with the strategic level, this should provide a 
well-built enterprise architecture that enables flexibility for 
changes on every level without failure, which leads to creating 
value and balance to resist failure in the long term. 
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Figure 1. Clarification of Architecture landscape in relation to Conops 
document 

B. ConOps and ArchiMate Motivation Extension 

In this research we utilize ArchiMate to support analysis 
and problem definition. ArchiMate is an open and independent 
enterprise architecture modeling language to support the 
description, analysis and visualization of architecture within 
and across business domains in an unambiguous way. 
ArchiMate 3.0 consists of core language, which focuses on the 
description of the four architecture domains defined by 
TOGAF standard: business architecture, data architecture, 
application architecture and technology architecture and two 
extensions: Motivation and Migration [6]. ConOps main 
objective is to make sure operational needs are clearly 
understood by various users. Our hypothesis is that we can 
improve understandability of the various concepts of the 
document and help identify operational needs by using 
ArchiMate concepts and viewpoint. However, detecting 
motivations can be complex; capturing such concerns in the 
form of drivers from stakeholders or from the environment can 
be confusing. That is why we also suggest leveraging ConOps 
document to get motivation extension aspects and support the 
architecture transition from present to desired state. 

III. research Methodology 

Based on the previous background study we decided to 
examine the shortcomings of ConOps document. ConOps 
document is still not used to its fullest potential, in many 
systems, it's usually made after the system is matured or 
delivered [4] [13]. There are many reasons Conops is 

underutilized; one of the reasons is the lack of a method for 
analysis and acquiring the content of the document .The IEEE 
standard provide guidelines content and format but no exact 
technique [5]. “Each organization that uses this guide should 
develop a set of practices and procedures to provide detailed 
guidance for preparing and updating ConOps 

documents”[3,piii]. This research is conducted using traditional 
analysis including background study, comparative and data 
analysis. 

Our hypothesis is that ArchiMate could aid in the process 
of creating ConOps document. The questions guiding the 
research are formulated as: 


Ql: To which extent can we model ConOps using 
ArchiMate? (Mapping and evaluation, section 4.1) 

Q2: How to produce ArchiMate model from ConOps 
Clause? (Method and demonstration, section 4.2) 

Q3: Does deriving ArchiMate model from ConOps promote 
understandability and facilitate communication? (Experiment 
and quantitative analysis, section 5) 

Q4: How efficient is it to create ConOps using ArchiMate? 
(Case study, section 6) 


IV. A PROPOSAL ON USING ARCHIMATE TO CREATE 

ConOps 

A. Mapping ArchiMate Concepts to ConOps clauses 

In order to map ConOps clause to the concepts of 
ArchiMate, we declare classification for ArchiMate concepts 
such from “Motivation Extension”, “Business Layer ’’and 
“Application Layer” as shown in Table 1. such as [8]: 

• Specialization Problem Element -Meta classes: 
Stakeholder (S), Driver (D) and assessment (A) 

• Specialization Intention -Meta classes (Actual 
motivation elements): Goal (G), Principle (P), 
requirement(R), Constraints (Con), outcome (O) 
and Value (V) 

• Strategy Elements: Capability(C), Course of action 
(CA) and resource(R) 


Table l.Mapping Motivation ArchiMate elements to ConOps clauses 
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1,3 System 
Overview 




1/ 

✓ 








Current 

System 

3.1 Background, 
Objective and 
scope 


✓ 

✓ 

✓ 

✓ 

V 







3.2 Operational 
policies and 
constraints 






✓ 

✓ 






Justification 

4.1 justification 

Of changes 


✓ 


✓ 


✓ 







ofchange 

4.2 Description of 
desired changes 

✓ 







✓ 

✓ 

✓ 

✓ 

✓ 

Proposed 

5.1 Background, 
Objective and 
scope 


|Z 

✓ 

1/ 

✓ 

✓ 







System 

5.2 Operational 
policies and 
constraints 







* 







• “Business Layer” & “Application Layer” 
elements including all elements in: Behavior 
entities, Passive entities and Active entities. The 
mapping is shown in Table 2. 


Table 2.Mapping the rest of ArchiMate layers to ConOps clause 


CnnOps Element 

ConOps Clause 

ArchiMate Laver 

Current System 

3.3 Description of the current system or situation 

Business Layer 

Application layer 

3.5 user classes and other involved personnel 

Business Layer 

Proposed System 

S.3 Description of the Proposed system 

Business Layer 

Application layer 

S.S user classes and other involved personnel 

Business Layer 

&.□ Operational Scenario 

Business Layer 
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Our efforts to establish connection between ArchiMate and 
ConOps started by creating ArchiMate Models from a number 
of ConOps documents. Based on ontology definitions and the 
created ArchiMate models, we were able to represent coverage 
of ConOps by ArchiMate as shown in Table 3. 


Table 3. ArchiMate elements coverage of ConOps clauses 


Conpp*i Main parts 

5ut>-smctlQn& 

Related ArchiMate 
clement 


i.l idenimcation 

Out of Scope 

Stop* 

1.2 Document Overview 

Out of Scope 


1.3 System Overview 

Motivation Extension 


3.1 Background, O&jectlve end scope 

Motivation Extension 


3.2 Operational policies end constraints 

Motivation Extension ] 

Current SyM*m 

3.3 Description of ift* current syswm or situation 

Business Layer, 
application layer. 

3.4 Modes of operation 

Out of Scope 


3.5 user classes end other involved personnel 

Business Laver 


3,6 support environment 

Out of Scope 


4.1 Justification of 

Motivation Extension 

Justification and 

4.2 Description of dewed changes 

Strategy Elements 

nature of cha nge 

4.3 priorities among Changes 

Out of Scope 


4.4 Changes considered Put not included 


$.1 Background, Objective and scope 

Motivation Extension 

Concept for proposed 
system 

5.2 Qwfltfaal oofeU* end aonsfrainUi 

5.3 Deearlpti&n of the Proposed system 

54 Mode* of operation_ 

5.5 user classes and olher involved personnel 

_Motivation Extension 

Business Layer 

Out Of Scops 1 

Business Layer 


5.5 support environment 

out of scope 

Operational scenario 

5.0 Qperalionel Scenario 

Business Laver 

Summary of impact 

7.0 Summwvonmpflct 

Out of scope 

Analysis ofth* 
proposed system 

e.o Analysis of use Purposed System 

out of scope 


B. Method and demonstration(Producing ArchiMate model 
from a Conops clause) 

To demonstrate modeling Conops using ArchiMate, we 
used the "United States Government Printing Office (CONOPS 
V2.0)"[10] as an example. Below is the ArchiMate model for a 
portion of the Conops clause “1.3 System Overview ” (Figure 
2 ). 



Figure 2. Modeling ArchiMate from ConOps, a) portion form ConOps 
document (top) and b) the derived ArchiMate model (bottom) 

We suggest that the process begins with identifying lead 
words in each section of the ConOps. "Lead words" are words 
related to ArchiMate elements based on ontology. Examples of 
lead words identified include: “future system”, “will be”, 


“Proposed system”, “Believe that”, “Should be”, “primarily”, 
“Services, “However”, “as a result”, “issue”, “failed”, “must”, 
etc. After identifying lead words we can match the sentence to 
the appropriate ArchiMate element based on ontological 
analysis, in Figure 2.a we show how sentences are divided 
based on lead words and in 2.b the sentences are assigned to 
the appropriate ArchiMate element. Deriving the ArchiMate 
model is based on the analysis shown in Table 4. [8][15]. The 
recognition of lead words in ConOps could assist in automating 
the process in the future by creating domain ontologies from 
textual documents [14]. 

Table 4.Analysis method used to create the corresponding ArchiMate model 


Con Ops element 

USystem evervJe* 
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A jiaaL of an undcLcrmiracd. scent. A boo) is lire pruixa lUonal 


COO WW of flgcrsfs in teuton. From (ho MG defuikton. we can 

Gut (Main) 

!. A- Uakohotdcr is commuted 1o achieving a float 

91. Achioriug (he goal moms bringing out certain effects in 

reality. 

-Be breve lb a (...Should 
be- 

Principle 

The propositional content P of (ho dessne k (he rcsiilL of 1h? 
application of Ihc predicate 0 on all systems in a given conlcst, 
to-. 

P ■ V ktSyiteik*) , 'ConlcxtPriadi pi e(s)}—Q( sfl where System 
hobdt far ill Systems, 

CttiteJtl Principle holds far all in (he -COnte *1 of 

■pplicsrian of the principle jjid Q holds for (he systems that 

Cxhibfl the desired properties slated in (he principle. 


V. Using ArchiMate to represent ConOps clauses 

PROMOTES UNDERSTANDABILITY(EXPERIMENT) 

In this research, we encourage using ArchiMate to represent 
ConOps clauses. ArchiMate promotes better understandability 
and can therefore ease communication and support consensus 
among stakeholders. Based on the findings from the previous 
section, we can safely assume that using ArchiMate to model 
ConOps is valid. But we need to prove that using ArchiMate to 
model ConOps benefits communication and promotes better 
understanding than using the ConOps document for 
stakeholders. For this reason our second assumption is that 
using ArchiMate models produced from ConOps can promote 
understandability. To prove our claim we conducted an 
experiment to measure the level of accuracy and difficulty. 

Measuring the understandability would determine if using 
ArchiMate to represent ConOps is of significant value. For this 
experiment we had a group of 10 postgraduate students from 
Yamamoto lab of Nagoya University's department of 
information science, they were divided in two groups 
randomly. The participants were familiar with ArchiMate and 
have had participated in a semester long ArchiMate session in 
the university. As for their knowledge of ConOps, we 
introduced the document elements before the experiment from 
the IEEE Std 1362-1998. For the experiment, we produced a 
number of ArchiMate models from ConOps (Natural language 
based) documents then created 2 sets of different questions to 
test accuracy and difficulty level, The following activities were 
presented to both groups: 

• Activity 1: Problem 1(P1) for Group 1(G1): 
Examined accuracy and understandability using a 
set of questions based on the original ConOps 
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(Natural language). Accuracy level: 53%, 
difficulty level: 2.5 

• Activity2: Problem 1 (PI) for group 2(G2): 

Examined accuracy and understandability using a 
set of questions based on derived ArchiMate from 
ConOps (Modeling language). Accuracy level: 
93%, difficulty level: 1 

• Activity3: Problem 2(P2) for Group 1(G1): 

Examined accuracy and understandability using a 
set of questions based on derived ArchiMate from 
ConOps (Modeling language). Accuracy level: 
56%, difficulty level: 2 

• Activity3: Problem 2(P2) for Group 1(G2): 

Examined accuracy and understandability using a 
set of questions based on the original ConOps 
(Natural language). Accuracy level: 24%, 
difficulty level: 3.6 


Figure 3.a. shows significant improvement in the level of 
accuracy when ArchiMate modeling language (ML) is used to 
represent ConOps as opposed to natural language (NL). Figure 
3.b. also shows that difficulty level decreases when participant 
answered questions from ArchiMate models based on ConOps. 
Figure 4 shows how Accuracy increases in percentage when 
levels of difficulty are decreased. 



Figure 3. The Effects of natural language and modeling language on 
"difficulty" and "accuracy” a) shows increase in "accuracy" results (top) b) 
shows "difficulty" level results (bottom). 
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Figure 4. Relationship between accuracy and difficulty in the experiment 

VI. A CASE STUDY ON USING ARCHIMATE CREATE CONOPS 
CLAUSES FOR "MAN-HOUR MANAGMENT SYSTEM" 

To measure the efficacy of creating ConOps using 
ArchiMate, we used ArchiMate as the main tool of 
communication for creating a "Man-hour management" system. 
Since there is no existing system, we skipped "current system" 
and included " justification for change" and "description of the 
proposed system" clauses for this case. 

In this experiment we attempt to measure effectiveness of 
producing ConOps using ArchiMate modeling language. We 
measured the time of execution for the two main work 
activities: creating "justification for change" ConOps clause 
(Wl) and creating "description of the proposed system" 
ConOps clause (W2). 

The time for execution included: time for communication, 
time for creating ArchiMate model and time for creating 
ConOps from ArchiMate. 

The results were compared to the time it took to create the 
ConOps for the new system without using ArchiMate. 

The process of creating ConOps from ArchiMate was far 
more efficient with better problem definition and analysis, 
easier communication and less ambiguous concepts and 
relations’ definition. 


A. Case background 

The company had problems with productivity of software 
development. There was a need for improvement, however it 
was difficult to find the problem. After discussion, it was 
determined the problem affecting productivity was the lack of a 
man-hour management system. There was also a concern from 
the Accounting department, which was related to improving 
their procedures of payroll management, the procedure 
involved manual labor that required a lot of time and effort 
because they relied on paper forms for payment procedures. 
(Note: ArchiMate Model created by the engineer) 

In order to create the corresponding ConOps clauses from 
the motivation and business model of ArchiMate we needed to 
have extra communication in addition after creating the models 
.It took us 20 minutes to create the Motivation model as shown 
in Figure 5.a. Another 16 minutes were spent on exchanging 
the extra information as shown in Figure 5.b. 
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From the ArchiMate model in Figure 5.a, we derived the 
ConOps clause ’’Justification for and nature of change” from 
the motivational model of ArchiMate and from the extra 
communication as shown in Figure 6.The process of deriving 
ConOps from the ArchiMate model took 10 minutes. 

We then created the Business model in 40 minutes (Figure 
7.a), extra communication took 8 minutes (Figure 7.b). 

Figure 8 shows the derived ConOps clause ’’Concepts of the 
proposed system”. 



Figure 5. a) Motivation model using ArchiMate (top), b) extra 
Communication (bottom) 


4.(1 Justification for and nature of changes (Time 10 minutes) 

Project manager Is ebueeraed about productivity; the current system mainly suiters from low productivity-to 
address the Cow productivity Issue, visualization of the current productivity end finding improvement points are 
needed, to achieve the goal of visualization of turre nt productivity there is a need to have a record of man-hour 
for each task,to find improve men: points there is also a need To make work breakdown structure (WEIS) to dearly 
define tasks. There Is also a concern raised from the Accounting department The department wishes to improve 
payroll management, the analysis shows that the man-hour system is paper based. To improve the payroll 
management system, based on the previous analysis, change to computerized form is need.ed.to achieve a fully 
computerized (automated) system. 


Figure 6. the derived ConOps from The ArchiMate model 



t) What arc the proponed system major 
coapaneats ? 

The major conponant ie tho project 
managei&ent tool (it waa not dearly 
described in this ArchiMate model}. It 
can provide a functions, which were 
Planning, Analyzing, Recording, and 
Browsing and the recorded information 
can be used in payroll management. 

f) wbst are the main capabilities and 
functions of the proposed system? 

The main capabilities are increasing 
productivity, and improving payroll 
management such as reducing operating 
time. And function* are Planning, 
Analyzing, Recording, and Browsing. 


Figure 7. a) Business model using ArchiMate (top), b) extra communication 

(bottom) 


Produced ConOps clause: 

§.0 Concepts of the proposed system (Time lSitiinules) 

The proposed new system "project management' will provide the following functions: planning, 
analyzing, recording and browsing." Record mg" function; This function fulfills the requirement 
"creating The man-hour record "and "Record man-hour" .the function serves the role of'recording 
status" for "team members' stakeholder. Recording function will use information from the object 
■Task" and send Information to "actual man-hour” object “Task" object as whole Is part of "WBS" 
object and "actual mao-hour" is partially part of the object "task". "Estimating Man-hour" object as 
whole Is part of object "Task". "Analyzing" function; This function fulfills the requirement "make 
WBS clearly define tasks",] t serves the role of "Improvement" and "managing “for the stakeholder 
"project management". The function receives information from the "actual man-hour “and 'task 
"object. "Planning” function: This function fulfills the requirement" make WBS clearly define tasks 
". It serves the role of "planning" for the stakeholder "project managementT. The function sends 
results to "WBS" object "Browsing" function: This function serves the role "Pay salary" for the 
accounting department It receives data from the object "Actual man-hour'. 

Figure 8. The derived ConOps from The ArchiMate model 

VII. Discussion 

This paper mainly investigates the relation between 
ArchiMate as a modeling language and ConOps as a natural 
language based document. 

A. Summary of the findings 

In section 3,we formulated the research questions guiding 
our investigation. Here, we show the main findings (FI, F2, F3, 
E4) of this research based the research questions: 

FI: In section 4 we presented a mapping table based on 
ArchiMate classification for Motivation extension and 
’’Business Layer” and ’’Application Layer” in table 1 and 

2, during our investigation we examined a set of ConOps 
documents and put the effort to establish connection through 
modeling all the ConOps clauses using ArchiMate. The results 
determined showed us the extent of coverage as shown in table 

3. 

To evaluate the mapping we examined two perspectives of 
transformation, first one is modeling ArchiMate from ConOps 
(ConOps perspective), and the second is writing ConOps from 
ArchiMate (ArchiMate perspective). 

For the first part of the evaluation, we examined the 
coverage from the ConOps perspective and while only 44.4% 
of ConOps clauses can be represented by ArchiMate elements 
based on Table.3, we believe that ArchiMate is still suitable 
tool to facilitate the creation of ConOps document based on the 
fact that the main challenge of writing the ConOps document 
stems from the lack of specific analysis methods [5] and some 
sections considered out of scope for ArchiMate representation 
in ConOps are mostly descriptive and doesn’t require farther 
analysis such as clauses 1.1 Identification, 1.2Document 
Overview. 

For the second part of evaluation we examine the mapping 
from ArchiMate perspective (transformation from ArchiMate 
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to ConOps) we evaluated the mapping based on the evaluation 
method by [9] figure 9 shows the result of the evaluation. The 
evaluation shows that mapping is complete, not redundant and 
there is no excess. However, there is an overload, which means 
that more than one clause from Conops can be matched to one 
concept in ArchiMate. The previous two evaluations show that 
it is more effective to model ArchiMate first to create ConOps 
document. 


incompleteness 

£very AnchiMaie concept «n be 
represented by ConQpj clauses 

Complete 

Redundancy 

ConQps clauses tent be represented 
by more than one concept in 

ArchiMate 

Not 

redundant 

Excess 

All ArchiMate concepts are present in 
Conops clause 

Nq Excess 

Overload 

More than one clause from ConOps 
can be matched to one concept from 
AnchiMate 

Overload 


Figure 9. Evaluation based on mapping ArchiMate elements to ConOps 
clauses 

F2: In section 4.2, we’ve shown how to model ArchiMate 
from a ConOps clause by demonstration in figure 2. We had to 
depend on ontological analysis to extract the corresponding 
ArchiMate elements because ConOps is a textual document 
dependent on the author; there are sets of guidelines by the 
IEEE standardl 362-1998, however no specific method for 
analysis or clear rules. Using corresponding sentences using 
lead words proved to be useful. 

F3: The results of the experiment and quantitative analysis 
shows increase in accuracy through decrease of difficulty 
which shows level of understandability of ArchiMate model 
derived from ConOps was significantly higher, this indicates 
that a higher level of consonance could also be achieved among 
stakeholders. 

F4: Evaluation Of Case Study: from the previous 
experimental case study, we conclude that creating ConOps 
from ArchiMate has many benefits, there were however some 
shortcomings for creating ConOps from ArchiMate. During 
communication, some concepts referred to in the ConOps 
document and their meaning in ArchiMate such as the concept 
of ’’Capability” and "Component” mentioned in figure 7 were 
ambiguous. 

If ArchiMate meta-model concepts and relations are 
specified using OWL-DL specification, then the process of 
producing ConOps from ArchiMate can be automated and 
checked for consistency [16]. Process should include using 
NLP (natural language processing), domain ontologies and 
ontology models. 

Based on the previous findings we assert that ArchiMate 
can be used to facilitate the analysis process, improve 
understandability and therefore communication. Other 
modeling tools have been known to aid in the process of 
creating ConOps document, using ArchiMate in combination 
with these tools could promote better coverage. 

There have been efforts to describe the advantages of using 
various systems thinking methods and modeling tools by 
papers such as [5]. This paper focuses on the relationship 
between ConOps and ArchiMate specifically and the proposal 


of further modeling tools or frameworks to create ConOps is 
out of this paper's scope. 

VIII. Conclusion and future work 

In this paper we investigated the relationship between 
ConOps and enterprise architecture, we then explained the 
relationship between ArchiMate concepts and ConOps. 
Creating ConOps document itself can be challenging because 
there is no exact technique to collect and analyze the data. For 
this reason we suggest the use of ArchiMate as a tool for 
analysis and communication. Based on our findings, we know 
that using ArchiMate could significantly improve 
understandability and promote better communication. We 
presented a mapping table between concepts of ArchiMate and 
ConOps clauses based on ontology definition, however there is 
still a need for a complete definition for concepts found in 
domain specific ConOps documents. To farther ease 
communication and to achieve better connection between 
ConOps and ArchiMate, we wish to create an accurate 
definition for most used concepts defined by the ConOps 
document guidelines and a specific domain ontology and map 
it to ArchiMate concepts. 

Based on our future research activities, we also aim to 
create a tool based on our definition. The tool should aid in the 
process of creating a consistent ConOps document and 
automate the process of identifying concept based on a set of 
rules using natural language processing and domain ontology 
and to derive and generate the corresponding ArchiMate 
models. 
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Abstract — this research was conducted to find out the level of 
information security in organization to give recommendations 
improvements in information security management at the 
organization. This research uses the ISO 27002 by involving the 
entire clause that exists in ISO 27002 check-lists. Based on the 
analysis results, 13 objective controls and 43 security controls 
were scattered in 3 clauses of ISO 27002. From the analysis it 
was concluded that the maturity level of information system 
security governance was 2.51, which means the level of security 
is still at level 2 planned and tracked is planned and tracked 
actively) but is approaching level 3 well defined. 

Keywords: Academic Information Security; Security System; 
Maturity Level; ISO 27002; SSE-CMM; 

I. Introduction 

Academic Information System, has been widely used by 
almost all universities in Indonesia, it is intended to facilitate 
the delivery of information to learners, and teaching staff and 
administrative personnel in the management. The more 
interaction between the system and the user the better system 
will be vulnerable to being infiltrated or damaged by 
irresponsible parties. It will be a new issue in terms of 
security. 

Academic information system as a student academic 
management needs to ensure the security and privacy and 
integrity of data processed, in addition to the performance of 
information systems also become an important part that must 
be considered so that information systems can be used 
optimally. 

The security issues sparked the mechanism to control 
access to the network in order to protect it from intruders [1]. 
On software development that supports forensic network is 
how to determine the appropriate method to facilitate the 
processing of log data [2]. The system can continue to run in 
accordance with the needs and their usefulness. It is 
necessary to process performance measurements taken 
through examination. In order for an information system 
security check to work properly a standard is required to do 


so. Formally there is no standard reference on what standards 
will be used or selected by an organization to carry out 
security checks of information systems, so that it can use 
standards as required. 

Information security is a must. The issue is important 
because if the information can be accessed by people who are 
not responsible then the accuracy the information will be 
doubted can even be misleading information. The following 
is some formulation of the problems obtained in research 
whether the system security on the system the academic 
information used is in accordance with the standards and the 
extent of the system readiness academic information in the 
application of information security standards. Besides what 
role standardize the security of information systems in 
safeguarding stored information from various threats which 
exists. 

The purpose of this study is to obtain accurate 
measurement results in terms of information security on 
academic information systems and improve the quality of 
information security in accordance with ISO 27002 standard. 
In addition to knowing the maturity level of security systems 
used in academic information systems. It is expected that the 
results can be used as materials considerations in order to 
prepare measures to improve information security system 
management 

II. Literature Rivew 
A. Information Security 

Information security is the preservation of information 
from all possible threats in an attempt to ensure or ensure 
business continuity, minimize business risk, and maximize or 
accelerate return on investment and business opportunities 

[3]. 

Information security has some aspects that must be 
understood to be able to implement it. Some of these aspects, 
the first of three that are most commonly named C.I.A 
triangle model, as shown in Figure 1[4]. 
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Figure 1. Aspects of Information Security 

Confidentiality, integrity and availability are basic 
requirements for business information security and provide 
the maintenance requirements of the business [3] [4]. 

• Confidentiality (C): All information must be 
protected according to the degree of privacy of their 
content, aimed at limiting its access and used only by 
the people for whom they are intended; 

• Integrity (I): All information must be kept in the same 
condition in which it was released by its owners, in 
order to protect it from tampering, whether 
intentional or accidental 

• Availability (A): All the information generated or 
acquired by an individual or institution should be 
available to their users at the time they need them for 
any purpose 

It is not limited to data stored in IT systems; all valuable 
information, regardless of the way it is recorded or stored, 
needs to be safeguarded. Such data includes not only privacy- 
sensitive information but also research data and copyrighted 
materials. 

Information security is the protection of information from 
a wide range of threats in order to ensure business continuity, 
minimize business risk, and maximize return on investments 
and business opportunities. Information security is achieved 
by implementing a suitable set of controls, including policies, 
processes, procedures, organizational structures and software 
and hardware functions. These controls need to be 
established, implemented, monitored, reviewed, and 
improved, where necessary, to ensure that the specific 
security and business objectives of the organization are met. 
This should be done in conjunction with other business 
management processes. [4] 

B. SSE-CMM 

The Systems Security Engineering Capability Maturity 
Model (SSE-CMM) was developed with the objective of 
advancing security engineering as a defined, mature and 
measurable discipline. The model and its accompanying 
appraisal method are currently available tools for evaluating 
the capability of providers of security engineering products, 
systems, and services as well as for guiding organizations in 
defining and improving their security engineering practices 


SSE-CMM is the Capability Maturity Model (CMM) for 
System Security Engineering (SSE). CMM is a framework 
for developing the process, such as the technical process of 
both formal and informal. SSE-CMM consists of two parts, 
namely: The Model for process security techniques, projects 
and organizations, and assessment methods to know the 
maturity process. The SSE-CMM contains 11 process areas. 
The definition of each of the process areas below contains a 
goal for the process area and a set of base processes that 
support the process area. 

• Administer Security Controls 

• Assess Impact 

• Assess Security Risk 

• Assess Threat 

• Assess Vulnerability 

• Build Assurance Argument 

• Coordinate Security 

• Monitor System Security Posture 

• Provide Security Input 

• Specify Security Needs 

• Verify and Validate Security 

The five Capability Maturity Levels that represent 
increasing process maturity are: 

• Level 0 indicates not all base practices are 

performed. 

• Level 1 indicates all the base practices are performed 
but informally, meaning that there is no 
documentation, no standards and is done separately. 

• Level 2 planned & tracked which indicates 

commitment planning process standards. 

• Level 3 well defined meaning standard processing 
has been run in accordance with the definition. 

• Level 4 is controlled quantitatively, which means 
improved quality through monitoring of every 
process. 

• Level 5 is improved constantly indicating the 

standard has been perfect and the focus to adapt to 
changes. 

SSE-CMM method used by giving the score assessment on 
each area of the process that selected between 0 to 5 for each 
process areas[5]. 

SSE-CMM describes the essential characteristics of the 
organization's security engineering process which must exist 
to ensure good security techniques by not advocating the 
process. Certain or sequential, yet take the general practice 
observed in the industry. 
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Assessment and evaluation of investments that have 
been issued for the implementation of IT is proper to be 
considered. Based on some research explained that the 
company has begun to realize and start doing performance 
measurement and evaluation [6]. In the process of checking 
the security of information systems, some media used in data 
storage is very vulnerable to the occurrence of damage that 
can be done by certain parties [8]. 

Information security describes efforts to protect 
computers and non-equipment of computers, facilities, data, 
and information from abuse by irresponsible people. This 
definition includes quotes, fax machines, and all types of 
media, including paper and smartphone documents. For the 
use of smartphones, in terms of communicating has become a 
daily necessity. In some cases, smartphone usage may be 
misused for computer crime, from fraud, to extortion [5]. 

C. Maturity Level 

One of the tools of measurement of the performance of 
system information is a model of maturity level [8]. Maturity 
model for management and control in the process of 
information system based on the evaluation methods of the 
Organization, so that it can evaluate himself from level 0 
(none) to level 5 (optimistic). Maturity model is intended to 
determine the existence of the problem and how to determine 
the priority of improvement as shown in Table 1. 

To identify the extent to which the organization meets 
the standards information security, can use the identification 
framework that is represented in a level of maturity that has 
a level of grouping capabilities of the company. 


TABLE I. Criteria Assesment Index at Maturity Level 


Range 

Descriptions 

0-0.50 

Non-Existent 

0.51-1.50 

Initial / Ad Hoc 

1.51-2.50 

Repeatable But Infinitive 

2.51-3.50 

Define Process 

3.51-4.50 

Managed and Measurable 

4.51-5.00 

Optimized 


Assessment of the ability and maturity of selected IT 
processes using maturity level, the assessment results show 
the maturity level of existing IT processes. Next will be 
determined maturity targets for each selected IT process, the 
maturity target of each process is the ideal condition that will 
be achieved in the definition of the desired maturity level, 
which will then become the reference in the IT management 
model to be developed. 

Once the maturity level of the current process is set and 
the target of process maturity has been determined, then the 
gap between the current conditions and the targets to be 
achieved will be analyzed the identification of opportunities 
in the gap to be optimized. 


Descriptions measurement techniques are made by the 
nominal size to sort objects from the lowest to the highest, 
these measurement only give the order rank. Measurements 
were carried out directly from value that refers to the values 
of the exiting sorting in maturity models as show in Table 2 

[9]. 


TABLE II. Maturity Level 


Range 

Descriptions 

0 Existent 

The company does not care about the importance of 
information technology to be managed either by the 
management 

1 Initial 

Company reactively performs application and 
implementation of information technology in 
accordance with the needs of existing sudden, 
without preceded by prior planning. 

2 Repeatable 

The Company has a pattern that is repeatedly 
performed in conducting activities related to the 
management of information technology 

governance, but its existence has not been well 
defined and that is still happening formal 
inconsistency. 

3 Define 

The Company has had formal and written standard 
operating procedures that have been socialized to 
all levels of management and employees to be 
obeyed and worked in daily activities. 

4 Manage 

The company has had a number of indicators or 
quantitative measures that serve as targets and 
objective performance of every application of 
information technology applications. 

5 Optimized 

The Company has implemented the information 
technology governance refers to "best practice" 


D. ISO 27002 

ISO 27002 is published by the International 
Organization for Standardization (ISO) and the International 
Electro-technical Commission (IEC). ISO 27002 was 
originally named ISO/IEC 1779, and published in 2000. It 
was updated in 2005, when it was accompanied by the newly 
published ISO 27001. The two standards are intended to be 
used together, with one complimenting the other. 

The standards are updated regularly to incorporate 
references to other ISO/IEC issued security standards such 
as ISO/IEC 27000 and ISO/IEC 27005, in addition to add 
information security best practices that emerged since 
previous publications. These include the selection, 
implementation and management of controls based on an 
organization's unique information security risk environment. 

ISO/IEC 27002 is a code of practice - a generic, 
advisory document, not a formal specification such as 
ISO/IEC 27001. It recommends information security 
controls addressing information security control objectives 
arising from risks to the confidentiality, integrity and 
availability of information. Organizations that adopt 
ISO/IEC 27002 must assess their own information risks, 
clarify their control objectives and apply suitable controls (or 
indeed other forms of risk treatment) using the standard for 
guidance. 


141 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 





International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


The standard is structured logically around groups of 
related security controls. Many controls could have been put 
in several sections but, to avoid duplication and conflict, 
they were arbitrarily assigned to one and, in some cases, 
cross-referenced from elsewhere. For example, a card- 
access-control system for, say, a computer room or 
archive/vault is both an access control and a physical control 
that involves technology plus the associated management or 
administration and usage procedures and policies. This has 
resulted in a few oddities (such as section 6.2 on mobile 
devices and teleworking being part of section 6 on the 
organization of information security) but it is at least a 
reasonably comprehensive structure. It may not be perfect 
but it is good enough on the whole, as shown in Figure 2 [6]. 
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Figure 2. Contents of ISO/IEC 27002:2013 
E. Gap Analysis 

Gap analysis is one tool that can be used to evaluate 
employee performance. Gap analysis or gap analysts are also 
one of the most important steps in the planning stages as 
well as the job evaluation phase. 

This method is one of the most common methods used 
in the management of an institution's internal management. 
Literally "gap" identifies a disparity between one thing and 
another. In general, the performance of a company or 
institution can be reflected in the operational systems and 
strategies used by the institution. [10]. The service quality 
gap is defined as the gap between the service that should be 
provided and the consumer's perception of the actual service 
provided. The smaller the gap, the better the quality of 
service 


the questions and the subjects. The position of the questions 
and subjects on the dimension can then be used to give them 
a numerical value. Guttman scaling is used in social 
psychology and in education. [7] 

The Guttman scale is one of the three major types of 
unidimensional measurement scales. The other two are the 
Likert Scale and the Thurstone Scale. A unidimensional 
measurement scale has only one (“uni”) dimension. In other 
words, it can be represented by a number range, like 0 to 100 
lbs or “Depressed from a scale of 1 to 10”. By giving the 
test, a numerical value can be placed on a topic or factor. 

The scale has YES/NO answers to a set of questions that 
increase in specificity. The idea is that a person will get to a 
certain point and then stop. For example, on a 5-point quiz, 
if a person gets to question 3 and then stops, it implies they 
do not agree with questions 4 and 5. If one person stops at 3, 
another at 1, and another at 5, the three people can be ranked 
along a continuum. 

III. Research Methods 

This chapter describes how research where there are 
details about the material or the materials, tools, sequence of 
steps to be made in a systematic, logical so it can be used as 
guidelines are clear and easy to resolve the problems, analysis 
of results and the difficulties encountered. The sequence of 
steps problem solving research can be seen in Figure 3. 



F. Guttman Scale 

Guttman scaling was developed by Louis Guttman 
(1944, 1950) and was first used as part of the classic work on 
the American Soldier. 

Guttman scaling is applied to a set of binary questions 
answered by a set of subjects. The goal of the analysis is to 
derive a single dimension that can be used to position both 


Figure 3. Steps of Research Activities 

In this research, the method used is qualitative research 
method, which data obtained based on the results of 
questionnaires distributed to respondents. In distributing 
questionnaires the authors make a list of questions based on 
the standards contained in ISO 27002 on instructions 
implementation of information security management which 
consists of 3 criteria or clauses. The scope of security checks 
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of information systems is done by determining the control 
objectives to be used. And it has been engagement letter of 
agreement that has been made before. Company needs to 
make an election to the existing controls by taking into 
account its organizational needs, how to apply and determine 
risks if those controls are not met. Controls are designed to 
provide assurance that managerial actions can ensure that 
business objectives will be achieved and unwanted events 
will be prevented, detected and improved (Sarno, 2009) [12]. 
Table 3 is a mapping of the guidelines used against ISO 
27002 clauses. 


TABLE III. Clause of ISO 27002 


Clauses 

Descriptions 

9 

Access Control 

11 

Physical and Environmental Security 

14 

System Acquisition, Development, and Maintenance 


Secondary data that the authors use in this study obtained 
through literature or literature studies such as books, journals, 
and proceedings. From the results of questionnaires spread 
then processed by using Maturity Level to get the results of 
the calculation level of information security maturity. The 
scale used in this questionnaire uses the Guttmann scale. 
Measurement scale with this type, will get a firm answer, that 
is yes-no, right-wrong, never-never, positive-negative and 
others. 

In this research answer questionnaire provided two 
choices that are choice Yes and answer No. In the calculation, 
the answer Y (Yes) is converted to a value of 1, and the 
answer N (No) is converted to a value of 0. 

The software used in this maturity level calculation is 
Microsoft Excel. After all the results of the questionnaire are 
included in the table, then calculated maturity level of each 
process in each clause for each respondent. Based on the 
questionnaire that was distributed to respondents selected for 
filling out the questionnaire in this study were 7 respondents, 
as shown in Table 4. 

Analysis and interpretation of data from the results of data 
processing and interviews with the manager of academic 
information systems can be used as research findings, based 
on the results of maturity level calculation, it can see the gap 
and can determine the expected value that will make 
recommendations from each control objective that need 
improvement. 


TABLE IV. Respondents 


No 

Functional Structure 

E 

1 

Head of Information Technology 

l 

2 

Assistant of information technology Development of 

Systems and Applications 

l 

3 

Executing Administration Information System 

2 

4 

Senior Executing Data Processing and Reports 

1 

5 

Programmer 

2 

Respondents 

7 


IV. Result and Analysis 

In this section will explain the results of analysis on the 
implementation and performance measurement of the 
maturity level of academic information system security 
obtained from the results of questionnaires and interviews in 
accordance with the framework of ISO / IEC 27002. 

A. Summary of the Maturity Level 

Based on the results of the recapitulation of the results of 
questionnaires spread then made an average answer to the 
questionnaire calculated based on clauses and respondents to 
get the maturity level, the results are as follows: 

1) Maturity level Result Clause 9 : Access Control 
Based on the calculation of maturity level, the value 
obtained in clause 9 about access control information 
security is at the Initial / Ad Hoc level on position value of 
1.44 which means current information security academic 
information system yet in accordance with standard 
processes and should be improved. Duties and 
responsibilities information security should be implemented 
by all staff who run the information system academic. Third 
parties are not allowed to access non-information is 
authorized, third parties may only access general data, as 
shown in Table 5. 


TABLE V. Calculation of Clause 9: Access Control 


Control 

Object. 

Description 

Index 

9.1.1 

Access control policy 

1.54 

9.2.1 

User registration 

1.10 

9.2.2 

Privilege or special management 

0.89 

9.2.3 

User password management 

0.50 

9.2.4 

Review of user permissions 

1.20 

9.3.1 

Use of passwords 

0.90 

9.3.2 

Unattended user tools 

2.10 

9.3.3 

Clear desk and clear screen policies 

1.20 

9.4.1 

Network service usage policy 

1.40 

9.4.2 

User authentication to connect out 

1.00 

9.4.5 

Separation with the network 

1.20 

9.4.6 

Control over network connections 

0.50 

9.4.7 

Control of network routing 

1.20 

9.5.1 

Safe log-on procedures 

1.67 

9.5.2 

User identification and authentication 

2.67 

9.5.3 

Password management system 

2.78 

9.5.4 

Use of system utilities 

1.50 

9.5.5 

Session time-out 

1.40 

9.5.6 

Connection timeout 

1.75 

9.6.1 

Information access restrictions 

1.90 

9.6.2 

Isolate sensitive systems 

1.20 

9.7.1 

Communication and computerized moving 

2.00 
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The maturity level calculation results in clause 9 can be 
represented in graphical form. The result of the maturity 
level calculation of clause 9 of access control can be seen in 
Figure 4. 
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Figure 4. Maturity Level of Clause 9: Access Control 


2) Maturity level Result Clause 11 : Physical and 
Environmental Security 

Based on the calculation of maturity level value obtained in 
process 11 about physical and environmental security is at 
the level of repeatable but intuitive at position value of 2.47 
which means current information security academic 
information system should be developed into better stages. 
Until now there has been no security audit process 
information on academic information systems, but the policy 
issued by management is evenly distributed to all existing 
parts. Important notes or important information is protected 
by the system to avoid damage and loss, as shown in Table 
6. 


TABLE VI. Calculation of Clause 11 : Physical and 
Environmental Security 


Control 

Object. 

Description 

Index 

11.1.1 

Physical security restrictions 

4.14 

11.1.2 

Physical in control 

2.1 

11.1.3 

Security office, space, and amenities 

3.23 

11.1.4 

Protection against external attacks and 
environmental threats 

1.1 

11.1.5 

Working in a safe area 

3 

11.1.6 

Public access, shipping area and drop of goods 

3.5 

11.2.1 

Placement of equipment and protection 

4.3 

11.2.2 

Supporting utilities 

2.34 

11.2.3 

Security of wiring 

1.71 

11.2.4 

Equipment maintenance 

1.56 

11.2.5 

Safety equipment outside the workplace that is not 
hinted 

2.41 

11.2.6 

Security of disposal or re-use of equipment 

2.5 

11.2.7 

Right transfer of equipment 

0.25 


The maturity level calculation results in clause 11: physical 
and environmental security can be represented in graphical 
form, as show in Figure 5 
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Figure 5. Maturity Level of Clause 11: Physical and Environmental 
Security 


3) Maturity level Result Clause 14 : System Acquisition, 
Development , and Maintenance 

Based on the calculation of maturity level value obtained in 
process 14 about the acquisition of information, 
development and maintenance systems is at the managed and 
measureable level at position value of 3.63 which means 
information security is standard and must documented and 
then published through training. Information systems 
academic is an interactive system because every validation, 
the system will be issues messages related to user-initiated 
activities. All information systems designed and built by the 
Information Technology Division without any interference 
hands of outsiders and out sourcing, as shown in Table 7 


TABLE VII. Calculation of Clause 14: System Acquisition, 
Development, and Maintenance 


Control 

Object. 

Description 

Index 

14.1.1 

Incorporate information security in the business 
continuity management process 

3.85 

14.2.1 

Validate input data 

3.5 

14.2.2 

Controls for internal processing 

3.9 

14.2.4 

Validation of output data 

3.4 

14.5.1 

Additional control procedures 

3.68 

14.5.3 

Restrictions on software package changes 

3.51 

14.5.4 

Weakness of information 

3.42 

14.6.1 

Control of technical weakness (Vulnerability) 

3.75 


The maturity level calculation results in clause 14: system 
acquisition, development, and maintenance can be 
represented in graphical form. The result of the maturity 
level calculation of clause 11 can be seen in Figure 6. 
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Figure 6. Maturity Level of Clause 14: System Acquisition, Development, 
And Maintenance 


B. The Gap of Value Maturity Level 

Based on the calculation of information security maturity 
level of the current academic information system in 2.51 
(define) and expected the maturity 1 level is 5 (optimized). 
The reason is its value achieved by level five (optimized) is 
the organization’s readiness in the field security policies, 
procedures and processes, and access control information 
security, can be seen Table 9 below. 


TABLE IX. Result of Calculation Maturity 


Clause 

Description 

Maturity 

Gap 

Current 

Expected 

9 

Access Control 

1.44 

5 

3.56 

11 

Physical and Environmental 
Security 

2.47 

5 

2.53 

14 

System Acquisition, 
Development, and 
Maintenance 

3.63 

5 

1.37 

Average 

2.49 


The value of maturity obtained from the average of 
respondents' answers to each clause contained in the ISO 
27002 standard. Table 8 shows the results of the calculation 
of the questionnaire to obtain the level of maturity academic 
information system. 


TABLE VIII. Result of Calculation Maturity 


Clauses 

Descriptions 

Index 

Level 

9 

Access Control 

1.44 

1 

11 

Physical and Environmental 
Security 

2.47 

2 

14 

System Acquisition, 

Development, and Maintenance 

3.63 

3 

Average maturity level 

2.51 

3 


The result of the calculation to get the average value of 
information security control on the academic information 
system of 2.51 From this value, it can be concluded that the 
security information is at level three, which is well defined 
or mean standard processing has been run in accordance with 
the definition. 

Based on the result from Table 8, for each process in the 
clause, it obtained graphs as in the Figure 7 below. 



9 11 14 

Clause 


Figure 7. Measurements graphs in maturity level 


Based on Table 9, the distance gap between the current 
conditions with the expected conditions for each clause is a 
clause 9 values gap value of 3.56, clause 11 value gaps of 
2.53, and in Clause 14 gap value is 1.37. 


After getting the value gap for each clause then all values 
are summed gap then averaged to obtain the value of the 
overall gap. The overall value of the gap there is a distance 
of 2.49 between the maturities of the current conditions with 
the maturity of the expected conditions. There is the fairly 
large gap, then the required adjustment of each control. 
Recommendations will be given to each control so much 
focus on the improvement of weak controls. Value ratio of 
the current maturity level and the value of the expected 
maturity level are depicted in Figure 8. 



^—Maturity Level 
Current 

^►Maturity Level 
Expected 


Figure 8. Result of Gap Analysis 

As shown in figure 8 that the current condition of the 
current Maturity level (current level) is represented on the 
blue line while the expected Maturity level (target level) is 
on the line red. It is seen that the expected maturity level of 
five is improved continuously which signifies the standard 
has been perfect and the focus to adapt to change. Level 
selection these targets are based on the consideration of the 
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results of the analysis where the objective value of the 
control is spread across the range values 1 and 3. 

And it is explained that the current level of security of 
the lowest value of the analysis gap is 3.56 in clause 9 with 
the maturity level of information security at the level of 1.44 
current conditions. While the highest maturity value in 
clause 14 with the value of maturity level reached 3.63 so it 
has the lowest analysis gap value is 1.37. Thus the higher the 
value gap clause, the more likely the clause is to get a 
security breach and the lower value of the gap in clause then 
the less likely the clause is to get security problems. 

V. Recommendations and Improvement Strategy 

After doing analysis and evaluation of information 
security, the researcher got some conditions in accordance 
with the ISO 27002 security controls have been set. Some of 
these conditions are: 

• There are rules about information security 

responsibilities in the employment contract 

employees. 

• There is a perimeter security to protect areas which 
contain information processing facilities. 

• The determination of business requirements for 

access control. 

• There is a responsibility of management on 

information security incident management. 

While conditions still need to be improved are: 

• The confidentiality agreement has not been described 
in detail and specifics. 

• There are no training related to information security, 
such as the criteria for good passwords, training in 
anticipation of a virus attack.. 

• Do not do the review and reform of the right of 
access on a regular basis. Renewal of permission is 
not required on a regular basis. 

• There are many policies and procedures have not 
been documented even some action in organization 
conducted by spontaneity and without any irregular 
formal. 

• Reexamination of the access rights of each and 
updating access rights in case of transfer of part or 
advancement in accordance with their respective 
access rights. 

• Every employee, contractor or third party should 
return all the company's assets used for work 
depending on the contract, when the employee, 
contractor or third party quit the company or moved 
other part. 

Defects equipment information system is one of system 
that occurs due to lack of maintenance carried out by the 
organization, a lack of management capacity and handling 
equipment made less coordinated. 


VI. Conclusions 

Based on the result of security analysis of information 
system in this research, determined 13 objective control and 
43 security control spread in 3 clause of ISO 27002 used in 
process of information security system audit. 

The SSE CMM measures maturity levels of the relevant 
security processes that an organization implements to achieve 
tended capability maturity levels of the security Pas. 
Application of the SSE-CMM is a straight forward analysis 
of existing processes to determine which base processes have 
been met and the maturity levels they have achieved. The 
same process can help an organization determine which 
security engineering processes they may need but do not 
currently have in practice 

The results obtained from the measurement of the level of 
maturity for academic system information is level 3 (well 
define). Results of the questionnaire management to obtain an 
average value for all of the clauses is 2:51 range of 0 to 5. 
And the value of the gap between current security conditions 
and the condition of the expected 2.49. From this value can 
be concluded that the security information on the level three, 
is defined process. 

Thus the results level of security is at level define process 
means the standard process has been running in accordance 
with the definition or in other words based on the vision, 
mission, the objectives, and direction of the organization's 
development procedures are standardized and documented 
and communicated through training, but the implementation 
is left to the team to follow the process, so that the deviations 
are known, the procedures are refined for the formalities of 
existing practice 
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Abstract - This paper describes a realization and research 
on a neural network for a generalized function with real 
inputs and a binary output (0 or 1). The neural network has 
been implemented of three different tools - a neural network 
simulator NeuroPh, logic programming language Visual 
Prolog and object - oriented programming language Java. 
The aim is to explore the neural network realization 
capabilities of the three tools - a neural network simulator, a 
logical programming environment, and a language for 
object-oriented programming. For this purpose is selected 
function with real inputs and binary output (0 or 1) whose 
values the neural network is trained to predict. The results 
obtained allow identifying the strengths and weaknesses of 
the three realized neural networks as well as the 
environments through which they are realized and tested. 

Keywords - neural networks, simulators, logic 

programming language, Visual Prolog, Java 

I. Introduction 

There are various methods and algorithms for 
modeling artificial neural networks (Nachev & al., 2009, 
2011). [5,6] 

The simulators are one of these methods, eg NeuroPh 
and Joone. They have a number of advantages as a 
convenient interface and tools to easily build models of 
various types of neural networks (Zdravkova & Nenkov, 
2016 a, b). They are available free environments with 
GNU Lesser General Public License (LGPL) and easy to 
absorb. 

The simulator selected for implementation to the 
neural network in this study is NeuroPh, which is Java - 
based, object - oriented simulator. NeuroPh is also open- 
source and it offers many opportunities for different 
architectures of neural networks. NeuroPh is lightweight 
frameworks allowed to simulate neural networks and can 
be use basic for the development of standard types of 
neural network architectures. It contains well designed 
open source library and a small number of core classes 
that correspond to basic concepts in neural networks. 
There is a good graphics editor to quickly build java - 
based components of neural networks. 


Neural networks for its remarkable ability to derive 
meaning from complicated and inaccurate data can be 
used to extract patterns and detect trends that are too 
complex to be noticed by humans or other computer 
techniques. Neural network learning can be considered as 
"experts" in the field of information, which is given for 
analysis. These experts can be used to make predictions 
qualitatively new situations and answer questions such as 
"what if". 

Neural networks take a different approach to problem 
solving than that of conventional computers. Conventional 
computers use an algorithmic approach ie computer 
follows a set of instructions for solving problems. 
Computer can not solve the problem if it is not aware of 
specific steps that follow. This limits the ability to solve 
problems conventional computers to problems that are 
already known and have a solution. But computers would 
be much more - useful if they could do things that people 
do not know exactly how to do. 

Neural networks process information in a way similar 
to the human brain. The network consists of a large 
number of highly interconnected elements that work in 
parallel to solve a specific problem. Neural networks learn 
by example. They can be programmed to perform a 
specific task. Examples must be selected carefully, 
otherwise lost valuable time or even worse the network 
may not function properly. The downside is that since the 
network "reveals itself" how to solve the problem, its 
actions can not be predicted. [1,2,4] 

The function selected for the survey is with input 
data, which are real numbers. Such data is also obtained 
by using inputs from sensors and sensors. 

II. Methodology 

In order to effectively test neural network 
implementations, it is necessary to create a more complex, 
generalized function. It will be implemented through a 
neural network of the NeuroPh simulator, the Visual 
Prolog logic development environment, and the object- 
oriented Java programming language. Through the test 
results, the link between the selected neural network 
encoding method and its performance can be seen. It will 
be possible to see again the advantages and disadvantages 
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of each of the environments in the realization of a neural 
network, and how this influences its learning and results. 

The selected function has two inputs, which are real 
numbers and one output is an integer (0 or 1). The number 
used for training is shown in Table. 1. 

Implementation of the NeuroPh Network Simulator is 
required. It is a Java-based simulator that has a good 
interface and a variety of training and neural network 
building options. In this case, the neural network needs 
two inputs and one output. The function is not typically 
linear, so a hidden layer is also needed. The architecture 
of the neural network looks like following: 



Figure 1. Neural network for generalized function in NeuroPh 

We also create a DataSet that will teach and test the 
neural network. The plurality contains the data presented 
in Table I. 


Table I: Data Set 


Input 1 

Input 2 

Output 

-0.5982 

0.9870 

1.0 

-0.2019 

0.6210 

1.0 

0.4715 

0.4822 

1.0 

-0.0982 

0.5876 

1.0 

-0.3566 

0.6371 

1.0 

0.6388 

0.4211 

1.0 

0.6298 

0.2815 

1.0 

-0.4622 

0.6166 

1.0 

-0.0733 

0.5582 

1.0 

-0.5541 

0.5125 

1.0 

-0.4376 

0.8781 

1.0 

-0.2224 

0.8885 

1.0 

0.0935 

0.6731 

1.0 

0.5317 

0.5437 

1.0 

0.4021 

0.5164 

1.0 

0.4756 

0.6506 

1.0 

-0.2338 

0.6364 

1.0 

-0.3158 

0.7503 

1.0 

-0.4735 

0.6385 

1.0 

0.5924 

0.8926 

1.0 

-0.2261 

0.7979 

1.0 

-0.4400 

0.5210 

1.0 

-0.5465 

0.7458 

1.0 

0.4640 

0.5107 

1.0 

-0.1519 

0.8122 

1.0 

0.4854 

0.8202 

1.0 

0.3473 

0.7081 

1.0 

0.4390 

0.6282 

1.0 


-0.2142 

0.6436 

1.0 

0.5738 

0.6371 

1.0 

0.3872 

0.5858 

1.0 

0.3204 

0.5353 

1.0 

-0.2078 

0.6513 

1.0 

-0.1865 

0.8175 

1.0 

0.2475 

0.3908 

1.0 

0.6605 

0.8992 

1.0 

-0.2866 

0.7338 

1.0 

-0.3259 

0.3987 

1.0 

-0.2520 

0.6736 

1.0 

0.3726 

0.4979 

1.0 

-0.2910 

1.0437 

1.0 

-0.3047 

0.8686 

1.0 

-0.2139 

1.0932 

1.0 

-0.3683 

0.7564 

1.0 

-0.4693 

0.8878 

1.0 

0.3935 

0.7798 

1.0 

-0.4564 

0.8052 

1.0 

0.5113 

0.7661 

1.0 

0.2255 

0.4645 

1.0 

0.0146 

0.4019 

1.0 

-0.1917 

0.8094 

1.0 

0.3832 

0.7560 

1.0 

0.4979 

0.6133 

1.0 

0.3534 

0.7732 

1.0 

-0.3472 

0.7018 

1.0 

0.5838 

0.7636 

1.0 

-0.1373 

0.7125 

1.0 

0.3883 

0.4498 

1.0 

-0.5317 

0.6193 

1.0 

-0.1168 

0.8785 

1.0 

0.5434 

0.4117 

1.0 

-0.4540 

0.6651 

1.0 

-0.2191 

0.8348 

1.0 

0.3049 

0.9803 

1.0 

0.6568 

0.7577 

1.0 

0.6142 

0.7504 

1.0 

-0.4581 

0.7797 

1.0 

-0.2162 

0.8863 

1.0 

-0.2602 

0.8101 

1.0 

0.3188 

0.8452 

1.0 

-0.2373 

0.8018 

1.0 

0.5831 

0.7771 

1.0 

0.0284 

0.7579 

1.0 

-0.4184 

0.6804 

1.0 

0.6741 

0.6025 

1.0 

-0.2528 

0.7053 

1.0 

0.5161 

0.6209 

1.0 

0.2039 

0.9164 

1.0 

-0.1721 

1.0088 

1.0 

0.2727 

0.2935 

1.0 

0.0763 

0.5622 

1.0 

-0.3665 

0.6483 

1.0 

0.4429 

0.8009 

1.0 

-0.1998 

0.5430 

1.0 

-0.5408 

0.6529 

1.0 

-0.0706 

1.0030 

1.0 

0.5072 

0.3505 

1.0 

-0.0605 

0.6298 

1.0 

0.2153 

0.6026 

1.0 

0.4681 

0.8718 

1.0 

-0.2989 

0.7367 

1.0 

0.8613 

0.4729 

1.0 

0.7012 

0.7457 

1.0 

-0.1134 

0.6007 

1.0 

0.3123 

0.9076 

1.0 

-0.1217 

0.8411 

1.0 

0.3687 

0.3705 

1.0 

0.5731 

0.4095 

1.0 

-0.2584 

0.6719 

1.0 

0.3094 

0.5082 

1.0 

0.4332 

0.7702 

1.0 
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-0.3045 

0.5782 

1.0 

0.4428 

0.5802 

1.0 

-0.1944 

0.8988 

1.0 

-0.0611 

0.7418 

1.0 

0.0762 

0.3539 

1.0 

0.8583 

0.9582 

1.0 

0.3704 

0.7234 

1.0 

0.5148 

0.7620 

1.0 

0.4313 

0.5426 

1.0 

0.4229 

0.6524 

1.0 

0.2982 

0.9345 

1.0 

0.3713 

0.7009 

1.0 

-0.5153 

0.7647 

1.0 

0.3853 

0.6553 

1.0 

-0.3483 

0.5053 

1.0 

0.6851 

0.7807 

1.0 

-0.3653 

0.4570 

1.0 

-0.4090 

0.7423 

1.0 

0.4357 

0.4469 

1.0 

0.2689 

0.4456 

1.0 

-0.4925 

1.0144 

1.0 

0.0762 

0.6380 

1.0 

0.4923 

0.4688 

1.0 

-0.4025 

0.7130 

1.0 

0.0510 

0.1609 

0.0 

-0.7481 

0.0890 

0.0 

-0.7729 

0.2632 

0.0 

0.2184 

0.1271 

0.0 

0.3727 

0.4966 

0.0 

-0.6293 

0.6320 

0.0 

-0.4331 

0.1448 

0.0 

-0.8415 

-0.1913 

0.0 

0.4753 

0.2248 

0.0 

0.3208 

0.3272 

0.0 

0.3206 

0.3341 

0.0 

-0.8908 

0.4117 

0.0 

0.1785 

0.4469 

0.0 

0.3156 

0.3885 

0.0 

0.5578 

0.4727 

0.0 

0.0319 

0.0122 

0.0 

0.2509 

0.3072 

0.0 

0.2357 

0.2249 

0.0 

-0.0724 

0.3338 

0.0 

0.5044 

0.0805 

0.0 

-0.6322 

0.4455 

0.0 

-0.7678 

0.2361 

0.0 

-0.7002 

0.2104 

0.0 

-0.6471 

0.1592 

0.0 

-0.7674 

0.0926 

0.0 

-0.5179 

0.0329 

0.0 

0.1752 

0.3453 

0.0 

-0.6803 

0.4761 

0.0 

0.0160 

0.3217 

0.0 

-0.7148 

0.5142 

0.0 

0.0784 

0.3228 

0.0 

-0.8087 

0.4704 

0.0 

-0.8421 

0.0929 

0.0 

-0.9859 

0.4831 

0.0 

0.2910 

0.3428 

0.0 

0.2432 

0.5149 

0.0 

-0.6010 

0.0506 

0.0 

-1.2465 

0.4592 

0.0 

-0.8277 

0.3619 

0.0 

-0.6212 

-0.1091 

0.0 

-0.7058 

0.6591 

0.0 

0.0672 

0.6057 

0.0 

0.3051 

0.4742 

0.0 

0.6079 

0.3936 

0.0 

-0.7894 

0.1759 

0.0 

-0.5312 

0.4265 

0.0 

0.2520 

0.1703 

0.0 

-0.5788 

0.2655 

0.0 

-0.8318 

0.5445 

0.0 


-0.6986 

0.3857 

0.0 

-0.7364 

0.1186 

0.0 

-0.9350 

0.1137 

0.0 

0.4396 

0.4143 

0.0 

-0.5469 

0.2496 

0.0 

-0.0841 

0.3652 

0.0 

0.3221 

0.6909 

0.0 

0.1076 

0.5795 

0.0 

-0.7186 

0.2565 

0.0 

-0.8788 

0.4506 

0.0 

-0.6985 

0.9505 

0.0 

0.3976 

0.1181 

0.0 

-0.5045 

0.5720 

0.0 

0.2502 

0.3978 

0.0 

0.6171 

0.1019 

0.0 

0.3183 

0.0879 

0.0 

-0.5745 

0.1862 

0.0 

0.0976 

0.5518 

0.0 

0.4845 

0.3537 

0.0 

0.5240 

0.4662 

0.0 

-0.7814 

-0.0753 

0.0 

-0.4970 

0.5995 

0.0 

-0.9698 

0.4662 

0.0 

0.4354 

0.1219 

0.0 

-0.6794 

0.3075 

0.0 

-0.6253 

0.0710 

0.0 

-0.0232 

0.4044 

0.0 

0.2320 

0.7107 

0.0 

0.0938 

0.4667 

0.0 

0.1423 

0.1790 

0.0 

-0.6169 

0.2551 

0.0 

0.2364 

0.5154 

0.0 

0.3891 

0.4043 

0.0 

-0.9518 

-0.0377 

0.0 

0.2409 

0.7195 

0.0 

0.1245 

0.4518 

0.0 

-0.6057 

0.2691 

0.0 

-0.7140 

0.3087 

0.0 

0.3101 

0.3468 

0.0 

0.1802 

0.4620 

0.0 

-0.4266 

0.6472 

0.0 

0.0614 

0.3249 

0.0 

0.0774 

0.3218 

0.0 

0.4281 

0.1345 

0.0 

-0.8025 

0.6688 

0.0 

0.4014 

0.4252 

0.0 

0.3708 

0.2641 

0.0 

-0.8077 

0.4149 

0.0 

0.5016 

0.2393 

0.0 

0.5824 

0.2284 

0.0 

-0.5914 

0.3023 

0.0 

-0.8704 

0.2694 

0.0 

-0.7209 

0.1968 

0.0 

0.2778 

0.2179 

0.0 

0.3324 

0.2735 

0.0 

-0.1409 

0.3925 

0.0 

-0.5976 

0.1479 

0.0 

-0.8558 

0.1451 

0.0 

-0.8891 

0.2690 

0.0 

0.2135 

0.4361 

0.0 

-0.5347 

0.5790 

0.0 

0.3169 

0.3971 

0.0 

-0.6812 

0.0421 

0.0 

-0.9759 

0.4596 

0.0 

0.4146 

0.2714 

0.0 

0.3275 

0.3678 

0.0 

-0.9321 

0.0936 

0.0 

0.5840 

0.4715 

0.0 

-0.4444 

0.2301 

0.0 

0.2911 

0.1937 

0.0 

-0.5108 

0.4150 

0.0 

-0.9660 

0.1793 

0.0 

0.1874 

0.2975 

0.0 
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0.1797 

0.4518 

0.0 

-0.7269 

0.3573 

0.0 

-0.5434 

0.41011 

0.0 


So after learning the network and testing, we get the 
following result for the average error: 

Total Mean Square Error: 0.3828746868257739 

The complete result of the neural network test is 
shown in Table. II. 


Table II: Results of a neural network in NeuroPh 


Input 1 

Input 

2 

Output 

Desired 

output 

Error 

-0.5982 

0.987 

0.1358 

l 

-0.8642 

-0.2019 

0.621 

0.1356 

l 

-0.8644 

0.4715 

0.4822 

0.1353 

l 

-0.8647 

-0.0982 

0.5876 

0.1356 

l 

-0.8644 

-0.3566 

0.6371 
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After the results of the NeuroPh simulator, we have to 
implement the neural network in logic programming 
language Visual Prolog. 

A class for the neural network is created. It is made up 
of three files: 

Network.cl - this is the class statement. 

Network.i - class interface. 

Network.pro - performance of the class. [3,7] 

In the dialog that implements the feature, it is 
described with the following Prolog code: 

onPushButtonClick( _Source) = button:: default Action: 
-XOR = network: :new(), 

XOR:setExamples( [ network::e(-0.5982,0.9870,1.0), 
network:: e(-0.2019,0.6210,1.0), 
network:: e(0.4715,0.4822,1.0), 
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network: :e(-0.0982,0.5876,1.0), 
network: :e(-0.3566,0.6371,1.0), 
network: :e(0.6388,0.4211,1.0), 
network: :e(0.6298,0.2815,1.0), 
network: :e(-0.4622,0.6166,1.0), 
network: :e(-0.0733,0.5582,1.0), 
network: :e(-0.5541,0.5125,1.0), 
netwo rk::e(-0.4376,0.8781,1.0), 
network: :e(-0.2224,0.8885,1.0), 
network: :e(0.0935,0.6731,1.0), 
netwo rk: :e(0.5317,0.5437,1.0), 
network:: e(0.4021,0.5164,1.0), 
network: :e(0.4756,0.6506,1.0), 
network: :e(-0.2338,0.6364,1.0), 
network: :e(-0.3158,0.7503,1.0), 
network: :e(-0.4735,0.6385,1.0), 
network: :e(0.5924,0.8926,1.0), 
network:: e(-0.2261,0.7979,1.0), 
network: :e(-0.4400,0.5210,1.0), 
network: :e(-0.5465,0.7458,1.0), 
network: :e(0.4640,0.5107,1.0), 
network: :e(-0.1519,0.8122,1.0), 
network: :e(0.4854,0.8202,1.0), 
network: :e(0.3473,0.7081,1.0), 
network: :e(0.4390,0.6282,1.0), 
network: :e(-0.2142,0.6436,1.0), 
netwo rk::e(0.5738,0.6371,1.0), 
network: :e(0.3872,0.5858,1.0), 
network: :e(0.3204,0.5353,1.0), 
netwo rk: :e(-0.2078,0.6513,1.0), 
network: :e(-0.1865,0.8175,1.0), 
network: :e(0.2475,0.3908,1.0), 
network: :e(0.6605,0.8992,1.0), 
network: :e(-0.2866,0.7338,1.0), 
network: :e(-0.3259,0.3987,1.0), 
network: :e(-0.2520,0.6736,1.0), 
network: :e(0.3726,0.4979,1.0), 
network: :e(-0.2910,1.0437,1.0), 
network: :e(-0.3047,0.8686,1.0), 
network: :e(-0.2139,1.0932,1.0), 
network: :e(-0.3683,0.7564,1.0), 
network: :e(-0.4693,0.8878,1.0), 
network: :e(0.3935,0.7798,1.0), 
network: :e(-0.4564,0.8052,1.0), 
network: :e(0.5113,0.7661,1.0), 
network: :e(0.2255,0.4645,1.0), 
network: :e(0.0146,0.4019,1.0), 
network: :e(-0.1917,0.8094,1.0), 
network: :e(0.3832,0.7560,1.0), 
network: :e(0.4979,0.6133,1.0), 
network: :e(0.3534,0.7732,1.0), 
network: :e(-0.3472,0.7018,1.0), 
network: :e(0.5838,0.7636,1.0), 
network: :e(-0.1373,0.7125,1.0), 
network: :e(0.3883,0.4498,1.0), 
network: :e(-0.5317,0.6193,1.0), 
network: :e(-0.1168,0.8785,1.0), 
network: :e(0.5434,0.4117,1.0), 
network: :e(-0.4540,0.6651,1.0), 
network: :e(-0.2191,0.8348,1.0), 
network: :e(0.3049,0.9803,1.0), 
network: :e(0.6568,0.7577,1.0), 


network: :e(0.6142,0.7504,1.0), 
network: :e(-0.4581,0.7797,1.0), 
network: :e(-0.2162,0.8863,1.0), 
network:: e(-0.2602,0.8101,1.0), 
network:: e(0.3188,0.8452,1.0), 
network:: e(-0.2373,0.8018,1.0), 
network: :e(0.5831,0.7771,1.0), 
network:: e(0.0284,0.7579,1.0), 
n e two rk: :e(-0.4184,0.6804,1.0), 
network:: e(0.6741,0.6025,1.0), 
network: :e(-0.2528,0.7053,1.0), 
network: :e(0.5161,0.6209,1.0), 
network:: e(0.2039,0.9164,1.0), 
network: :e(-0.1721,1.0088,1.0), 
network: :e(0.2727,0.2935,1.0), 
network:: e(0.0763,0.5622,1.0), 
network: :e(-0.3665,0.6483,1.0), 
network:: e(0.4429,0.8009,1.0), 
network: :e(-0.1998,0.5430,1.0), 
network: :e(-0.5408,0.6529,1.0), 
network: :e(-0.0706,1.0030,1.0), 
network: :e(0.5072,0.3505,1.0), 
network: :e(-0.0605,0.6298,1.0), 
network:: e(0.2153,0.6026,1.0), 
network: :e(0.4681,0.8718,1.0), 
network: :e(-0.2989,0.7367,1.0), 
network:: e(0.8613,0.4729,1.0), 
network: :e(0.7012,0.7457,1.0), 
network: :e(-0.1134,0.6007,1.0), 
network: :e(0.3123,0.9076,1.0), 
n e two rk::e(-0.1217,0.8411,1.0), 
network: :e(0.3687,0.3705,1.0), 
network: :e(0.5731,0.4095,1.0), 
network: :e(-0.2584,0.6719,1.0), 
network: :e(0.3094,0.5082,1.0), 
network: :e(0.4332,0.7702,1.0), 
network:: e(-0.3045,0.5782,1.0), 
network:: e(0.4428,0.5802,1.0), 
network: :e(-0.1944,0.8988,1.0), 
network: :e(-0.0611,0.7418,1.0), 
network: :e(0.0762,0.3539,1.0), 
network:: e(0.8583,0.9582,1.0), 
network: :e(0.3704,0.7234,1.0), 
network: :e(0.5148,0.7620,1.0), 
network: :e(0.4313,0.5426,1.0), 
network: :e(0.4229,0.6524,1.0), 
network:: e(0.2982,0.9345,1.0), 
network:: e(0.3713,0.7009,1.0), 
network: :e(-0.5153,0.7647,1.0), 
network:: e(0.3853,0.6553,1.0), 
network:: e(-0.3483,0.5053,1.0), 
network: :e(0.6851,0.7807,1.0), 
network: :e(-0.3653,0.4570,1.0), 
network: :e(-0.4090,0.7423,1.0), 
network: :e(0.4357,0.4469,1.0), 
network: :e(0.2689,0.4456,1.0), 
network: :e(-0.4925,1.0144,1.0), 
network: :e(0.0762,0.6380,1.0), 
network:: e(0.4923,0.4688,1.0), 
network: :e(-0.4025,0.7130,1.0), 
network:: e(0.0510,0.1609,0.0), 
network: :e(-0.7481,0.0890,0.0), 
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network: :e(-0.7729,0.2632,0.0), 
network: :e(0.2184,0.1271,0.0), 
network: :e(0.372 7,0.4966,0.0), 
network: :e(-0.6293,0.6320,0.0), 
network: :e(-0.4331,0.1448,0.0), 
netwo rk: :e(-0.8415,-0.1913,0.0), 
network: :e(0.4753,0.2248,0.0), 
network: :e(0.3208,0.3272,0.0), 
network: :e(0.3206,0.3341,0.0), 
network: :e(-0.8908,0.4117,0.0), 
network: :e(0.1785,0.4469,0.0), 
network: :e(0.3156,0.3885,0.0), 
network: :e(0.5578,0.4727,0.0), 
network: :e(0.0319,0.0122,0.0), 
network: :e(0.2509,0.3072,0.0), 
network: :e(0.2357,0.2249,0.0), 
network: :e(-0.0724,0.3338,0.0), 
network: :e(0.5044,0.0805,0.0), 
network: :e(-0.6322,0.4455,0.0), 
network: :e(-0.7678,0.2361,0.0), 
network: :e(-0.7002,0.2104,0.0), 
network: :e(-0.6471,0.1592,0.0), 
network: :e(-0.7674,0.0926,0.0), 
network: :e(-0.5179,0.0329,0.0), 
network: :e(0.1752,0.3453,0.0), 
network: :e(-0.6803,0.4761,0.0), 
network: :e(0.0160,0.3217,0.0), 
network: :e(-0.7148,0.5142,0.0), 
network: :e(0.0784,0.3228,0.0), 
network: :e(-0.8087,0.4704,0.0), 
network: :e(-0.8421,0.0929,0.0), 
network: :e(-0.9859,0.4831,0.0), 
network: :e(0.2910,0.3428,0.0), 
network: :e(0.2432,0.5149,0.0), 
network: :e(-0.6010,0.0506,0.0), 
network: :e(-l.2465,0.4592,0.0), 
network: :e(-0.8277,0.3619,0.0), 
netwo rk::e(-0.6212,-0.1091,0.0), 
network: :e(-0.7058,0.6591,0.0), 
network: :e(0.0672,0.6057,0.0), 
network: :e(0.3051,0.4742,0.0), 
network: :e(0.6079,0.3936,0.0), 
network: :e(-0.7894,0.1759,0.0), 
network: :e(-0.5312,0.4265,0.0), 
network: :e(0.2520,0.1703,0.0), 
network: :e(-0.5788,0.2655,0.0), 
network: :e(-0.8318,0.5445,0.0), 
network: :e(-0.6986,0.3857,0.0), 
network: :e(-0.7364,0.1186,0.0), 
network: :e(-0.9350,0.1137,0.0), 
network: :e(0.4396,0.4143,0.0), 
network: :e(-0.5469,0.2496,0.0), 
network: :e(-0.0841,0.3652,0.0), 
network: :e(0.3221,0.6909,0.0), 
network: :e(0.1076,0.5795,0.0), 
network: :e(-0.7186,0.2565,0.0), 
network: :e(-0.8788,0.4506,0.0), 
network: :e(-0.6985,0.9505,0.0), 
network: :e(0.3976,0.1181,0.0), 
network: :e(-0.5045,0.5720,0.0), 
network: :e(0.2502,0.3978,0.0), 
network: :e( 0.6171,0.1019,0.0), 


network: :e(0.3183,0.0879,0.0), 
network: :e(-0.5745,0.1862,0.0), 
network:: e(0.0976,0.5518,0.0), 
network:: e(0.4845,0.3537,0.0), 
network: :e(0.5240,0.4662,0.0), 
network: :e(-0.7814,-0.0753,0.0), 
network: :e(-0.4970,0.5995,0.0), 
network: :e(-0.9698,0.4662,0.0), 
network: :e(0.4354,0.1219,0.0), 
network: :e(-0.6794,0.3075,0.0), 
network: :e(-0.6253,0.0710,0.0), 
network: :e(-0.0232,0.4044,0.0), 
network: :e(0.2320,0.7107,0.0), 
network:: e(0.0938,0.4667,0.0), 
network: :e(0.1423,0.1790,0.0), 
network: :e(-0.6169,0.2551,0.0), 
network:: e(0.2364,0.5154,0.0), 
network:: e(0.3891,0.4043,0.0), 
network: :e(-0.9518,-0.0377,0.0), 
network: :e(0.2409,0.7195,0.0), 
network: :e(0.1245,0.4518,0.0), 
netwo rk: :e(-0.6057,0.2691,0.0), 
network: :e(-0.7140,0.3087,0.0), 
network:: e(0.3101,0.3468,0.0), 
network: :e(0.1802,0.4620,0.0), 
network:: e(-0.4266,0.6472,0.0), 
network: :e(0.0614,0.3249,0.0), 
network:: e(0.0774,0.3218,0.0), 
network: :e(0.4281,0.1345,0.0), 
network: :e(-0.8025,0.6688,0.0), 
network: :e(0.4014,0.4252,0.0), 
network: :e(0.3708,0.2641,0.0), 
network: :e(-0.8077,0.4149,0.0), 
network: :e(0.5016,0.2393,0.0), 
network: :e(0.5824,0.2284,0.0), 
network:: e(-0.5914,0.3023,0.0), 
network: :e(-0.8704,0.2694,0.0), 
network: :e(-0.7209,0.1968,0.0), 
network: :e(0.2778,0.2179,0.0), 
network:: e(0.3324,0.2735,0.0), 
network: :e(-0.1409,0.3925,0.0), 
network: :e(-0.5976,0.1479,0.0), 
netwo rk: :e(-0.8558,0.1451,0.0), 
network:: e(-0.8891,0.2690,0.0), 
network: :e(0.2135,0.4361,0.0), 
network:: e( -0.5347,0.5790,0.0), 
network:: e(0.3169,0.3971,0.0), 
network: :e(-0.6812,0.0421,0.0), 
network:: e(-0.9759,0.4596,0.0), 
network: :e(0.4146,0.2714,0.0), 
network:: e(0.3275,0.3678,0.0), 
network:: e( -0.9321,0.0936,0.0), 
network: :e(0.5840,0.4715,0.0), 
network: :e(-0.4444,0.2301,0.0), 
network: :e(0.2911,0.1937,0.0), 
network: :e(-0.5108,0.4150,0.0), 
network: :e(-0.9660,0.1793,0.0), 
network: :e(0.1874,0.2975,0.0), 
network:: e(0.1797,0.4518,0.0), 
network: :e(-0.7269,0.3573,0.0), 
network: :e(-0.5434,0.41011,0.0)]), 
training(XOR, 0.001, 0.5), 
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XOR:usit( [-0.5982,0.9870], XOR1), stdior.nl, 
stdio::write("xor( - 

0.5982,0.9870')=", XOR1), stdior.nl, stdior.nl, 

XOR:usit( [-0.2019,0.6210], XOR2), stdior.nl, 
stdiorwrite("xor( - 

0.2019,0.6210)=", XOR2), stdior.nl, stdior.nl, 

XOR:usit( [0.4715,0.4822], XOR3), stdior.nl, 
stdio::write("xor(0.4715,0.4822)=", XOR3), stdiornl 
, stdior.nl, 

XOR:usit( [-0.0982,0.5876], XOR4), stdiornl, 
stdio::write("xor( - 

0.0982,0.5876)=", XOR4), stdio::nl, stdior.nl, 

XOR:usit( [-0.3566,0.6371], XOR5), stdior.nl, 
stdiorwrite("xor( - 

0.3566,0.6371)=", XOR5), stdior.nl, stdiornl, 

XOR:usit( [0.6388,0.4211], XOR6), stdior.nl, 
stdiorwrite("xor(0.6388,0.4211)=", XOR6), stdiornl 
, stdior.nl, 

XOR:usit( [0.6298,0.2815], XOR7), stdiornl, 
stdio::write("xor(0.6298,0.2815)=", XOR7), stdiornl 
, stdior.nl, 

XOR:usit( [-0.4622,0.6166], XOR8), stdior.nl, 
stdio::write("xor( - 

0.4622,0.6166)=", XOR8), stdiornl, stdiornl, 

XOR:usit( [-0.0733,0.5582], XOR9), stdiornl, 
stdiorwrite("xor( - 

0.0733,0.5582)=", XOR9), stdiornl, stdiornl, 

XOR:usit( [-0.5541,0.5125], XORIO), stdiornl, 
stdiorwrite("xor( - 

0.5541,0.5125)=", XORIO), stdiornl, stdiornl, 

XOR:usit( [-0.4376,0.8781], XOR11), stdior.nl, 
stdio::write("xor( - 

0.4376,0.8781)=", XOR11), stdiornl, stdiornl, 

XOR:usit( [-0.2224,0.8885], XOR12), stdior.nl, 
stdio::write("xor( - 

0.2224,0.8885)=", XOR12), stdior.nl, stdior.nl, 

XOR:usit( [0.0935,0.6731], XOR13), stdiornl, 
stdiorwrite("xor(0.0935,0.6731)=", XOR13), stdior 
nl, stdio::nl, 

XOR:usit( [0.5317,0.5437], XOR14), stdior.nl, 
stdio::write("xor(0.5317,0.5437)=", XOR14), stdior 
nl, stdior.nl, 

XOR:usit( [0.4021,0.5164], XOR15), stdior.nl, 
stdio::write("xor(0.4021,0.5164)=", XOR15), stdior 
nl, stdior.nl, 

XOR:usit( [0.4756,0.6506], XOR16), stdior.nl, 
stdio::write("xor(0.4756,0.6506)=", XOR16), stdior 
nl, stdio: :nl, 

XOR:usit( [-0.2338,0.6364], XOR17), stdiornl, 
stdio: :write("xor( - 

0.2338,0.6364)=", XOR17), stdior.nl, stdior.nl, 

XOR:usit( [-0.3158,0.7503], XOR18), stdiornl, 
stdio: :write("xor( - 

0.3158,0.7503)=", XOR18), stdiornl, stdiornl, 

XOR:usit( [-0.4735,0.6385], XOR19), stdiornl, 
stdio: :write("xor( - 

0.4735,0.6385)=", XOR19), stdiornl, stdiornl, 

XOR:usit( [0.5924,0.8926], XOR20), stdior.nl, 
stdio::write("xor(0.5924,0.8926)=", XOR20), stdior 
nl, stdio: :nl, 

XOR:usit( [-0.2261,0.7979], XOR21), stdiornl, 
stdio: :write( "xor( - 


0.2261,0.7979)=", XOR21), stdior.nl, stdior.nl, 

XOR:usit( [-0.4400,0.5210], XOR22), stdior.nl, 
stdio ::write( "xor( - 

0.4400,0.5210)=", XOR22), stdiornl, stdiornl, 

XOR:usit( [-0.5465,0.7458], XOR23), stdior.nl, 
stdio: :write( "xor( - 

0.5465,0.7458)=", XOR23), stdior.nl, stdior.nl, 

XOR:usit( [0.4640,0.5107], XOR24), stdior.nl, 
stdio::write("xor(0.4640,0.5107)=", XOR24), stdior 
nl, stdio:ml, 

XOR:usit( [-0.1519,0.8122], XOR25), stdior.nl, 
stdio ::write( "xor( - 

0.1519,0.8122)=", XOR25), stdiornl, stdiornl. 

The result we get after training and testing the neural 
network is as follows. With 25 positive examples from the 
training set, we get the result shown in Figure. 2 and 
Table. III. 



Figure 2. Results of 25 positive input examples 


Table III. Results of 25 positive input examples 


Input 1 

Input 2 

Output 

-0,60 

0,99 

0,90 

-0,20 

0,62 

0,91 

0,47 

0,48 

0,93 

-0,10 

0,59 

0,91 

-0,36 

0,64 

0,90 

0,64 

0,42 

0,94 

0,63 

0,28 

0,93 

-0,46 

0,62 

0,90 

-0,07 

0,56 

0,91 

-0,55 

0,51 

0,89 

-0,44 

0,88 

0,90 

-0,22 

0,89 

0,91 

0,09 

0,67 

0,92 

0,53 

0,54 

0,93 

0,40 

0,52 

0,93 

0,48 

0,65 

0,93 

-0,23 

0,64 

0,91 

-0,32 

0,75 

0,91 

-0,47 

0,64 

0,90 

0,59 

0,89 

0,94 

-0,23 

0,80 

0,91 

-0,44 

0,52 

0,90 

-0,55 

0,75 

0,89 

0,46 

0,51 

0,93 

-0,15 

0,81 

0,91 
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With 30 examples from the training set - 15 positive 
and 15 negative, we get the result shown in Figure. 3. and 
in Table IV. 



Table IV. Results of 30 training examples 


Input 1 

Input 2 

Output 

-0,60 

0,99 

0,90 

-0,20 

0,62 

0,91 

0,47 

0,48 

0,93 

-0,10 

0,59 

0,93 

-0,36 

0,64 

0,91 

0,64 

0,42 

0,90 

0,63 

0,28 

0,94 

-0,46 

0,62 

0,93 

-0,07 

0,56 

0,90 

-0,55 

0,51 

0,91 

-0,44 

0,88 

0,89 

-0,22 

0,89 

0,90 

0,09 

0,67 

0,91 

0,53 

0,54 

0,92 

0,40 

0,52 

0,93 

0,05 

0,16 

0,93 

-0,75 

0,09 

0,91 

-0,77 

0,26 

0,87 

0,22 

0,13 

0,87 

0,37 

0,50 

0,92 

-0,63 

0,63 

0,93 

-0,43 

0,14 

0,89 

-0,84 

-0,19 

0,89 

0,48 

0,22 

0,86 

0,32 

0,33 

0,93 

0,32 

0,33 

0,93 

-0,89 

0,41 

0,93 

0,18 

0,45 

0,87 

0,32 

0,39 

0,92 

0,56 

0,47 

0,93 


With 4 sample inputs that are not from the training set: 
Total Err: 0.0950636151519453 


Table V. Results for 4 input examples 


Input 1 

Input 2 

Output 

0 

0 

0.909507490047197 

0 

l 

0.921189000491375 

l 

0 

0.940641576665331 

l 

l 

0.946039530567931 


It can be seen that in all three cases the average 
error is the same. The neural network implemented on 


Visual Prolog gives better results than the NeuroPh 
simulator. 

To complete the study, we compare these results 
with the results of a neural network implemented on the 
object-oriented Java programming language. 

The neural network is implemented with several 
classes of several .java files. The base class has the 
following program code: 

public class Main { 

public static void main(String [] args){ 

System, out.println("Starting neural network 
sample... "); 

float[][]x = 

DataUtils. readInputsFromFile( "data/x. txt”); 

int[] t = 

DataUtils. readOutputsFromFile( "data/t. txt"); 

NeuralNetwork neuralNetwork = new 
NeuralNetwork(x, t, new INeuralNetworkCallback() { 

@ Override 

public void success(Result result) { 

float[] valueToPredict = new floatf] {- 
0.205f 0.780f}; 

System. out.println("Success percentage: " + 
result. getSuccessPercentage()); 

System, out.println( 'Predicted result: " + 
result.predictValue( valueToPredict)); 

} 

@ Override 

public void failure(Error error) { 

Sy stem. out.println( "Error: " + 

error. getDescription()); 

i 

}); 

neuralNetwork. startLearning(); 

i 

i 

For the training and testing of the neural network, the 
plurality set forth in Table I is again used. The success 
rate of this neural network realized on Java is 80%. 

A comparison of the mean errors of the three realized 
neural networks is given in Table VI. 


Table VI. Comparison of the average error of neural 

NETWORKS 


Tools 

Average error of neural 
networks 

NeuroPh 

0.3828746868257739 

Visual Prolog 

0.0950636151519453 

Java 

0.8 


III. Conclusion 

Taking into account the results obtained, we can 
assert that the highest performance is the neural network 
implemented in the Visual Prolog logic development 
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environment. The realized neural network in the logical 
programming environment produces the smallest average 
output error. 

The neural network implemented on the neural 
network simulator NeuroPh has a much greater average 
error. An advantage here is that NeuroPh has a good and 
lightweight interface as well as a good visual 
representation of the network. These advantages make the 
neural network simulator very suitable for the training of 
beginner programmers on neural networks. 

The object - oriented Java programming language 
on which the third neural network is implemented gives 
the biggest error of the output. Here the neural network is 
in several classes - the several files, which also leads to a 
considerable amount of memory. 

Research conducted on the three neural networks 
tested with the same set of data show that the most 
appropriate and effective conversion is achieved with the 
Visual Prolog logic programming language. 

References 

[ljCubenko G 1989 Approximation by superpositions of a sigmoidal 
function, Mathematics of control, signals, and systems. 
http://www.scimagojr.com/joumalsearch.php ?q=21100376683&tip=sid 
&clean=0. 

[2] David J Livingstone 2009 Artificial Neural Network. Methods and 
Applications 

[3] Eduardo Costa 2009 Visual Prolog 7.2 for Tyros. http://wiki.visual- 
prolog.com/Index.php ?title=Visual_Prolog_for_Tyros 

[4] Heaton J T 2005 Introduction to Neural Network with Java 

[5] Nachev, A., Hogan, M., Stoyanov, B., Cascade-Correlation Neural 
Networks for Breast Cancer Diagnosis, Proceedings of the 2011 
International Conference on Artificiallntelligence, ICAI 2011, July 18- 
21, 2011, Las Vegas, USA, 2 Volumes 2011, IC-AI 2011: 475-480. 
http://world-comp.org/p2011/ICA4110.pdf 


[6] Nachev, A., B. Stoyanov,MLP and Default ARTMAP Neural 

Networks for Business Classification, Intelligent Decision Making 
System, Proceedings of the 4th InternationalISKE2009 Conference, 
World Scientific Series on Computer Engineering and Information 
Science 2, 179-185. 

http://eproceedings.worldscinet.com/9789814295062/9789814295062_0 
028.html 

[7] Thomas W.de Boer 2008 A Begginers’ Guide to Visual Prolog 

Groningen http ://wiki. visual - 

prolog.com/index.php ?title=A_Beginners_Guide_to_Visual_Prolog 

[8] Zdravkova, Elitsa., Nenkov, N.V., Comparative analysis of 

simulators for neuralnetworks Joone and NeuroPh, (On-line: 

www.cmnt.lv), ISSN 1407-5806, ISSN 1407-5814, 

TransportandTelecommunication Institute, Riga, Latvia, 2016, 20(1) 16- 

20 . 

http://www.scimagojr.com/journalsearch.php ?q=21100376683&tip=sid 
&clean=0 

[9] Zdravkova, Elitsa., Nenkov, N.V., Research of 

neuralnetworksimulatorsthroughtwotraining data sets, (On-line: 

www.cmnt.lv), ISSN 1407-5806, ISSN 1407-5814, 

TransportandTelecommunication Institute, Riga, Latvia, 2016 20(1) 12- 
15. 

AUTHORS PROFILE 

Elitsa Zdravkova Spasova, 15.02.1988, Bulgaria 

Current position, grades: assistant in Shumen University 
University studies: PhD in Informatics 

Scientific interest: Artificial Intelligence, Neural Networks, Genetic 
Algorithms 

Publications (number or main):7 

Address: Bulgaria, Shumen, Dedeagach 12, ap 24 

Phone: 0885222711 


157 https://sites.google.com/site/ijcsis/ 

ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


Using genetic algorithm for shortest path selec¬ 
tion with real time traffic flow 

Mohammad Aldabbagh, College of Computer Science and Mathematics, University of Mosul. 


Abstract —With the widespread of smart mobile devices and the 
availability of many applications that provide maps, many pro¬ 
grams have spread to find the closest and fastest routes between 
two points on the map. While the exactness and effectiveness of 
best path depend on the traffic circumstances, the system needs to 
add more parameters such as real traffic density and velocity in 
road. In addition, because of the restricted resources of phone de¬ 
vices, it is not reasonable to be used to calculate the exact optimal 
solutions by some familiar deterministic algorithms, which are 
usually used to find the shortest path with a map of reasonable 
node number. To resolve this issue, this paper put forward to use 
the genetic algorithm to reduce the computational time. The pro¬ 
posed system use the genetic algorithm to find the shortest path 
time with miscellaneous situations of real traffic conditions. The 
genetic algorithm is clearly demonstrate excellent result when ap¬ 
plied on many types of map, especially when the number of nodes 
increased. 


I. Introduction 

Cell phones and tablets became found everywhere as computers 
by the early 2000s, they also became more advanced to solve 
the daily problems of their users as calculating taxes, schedul¬ 
ing and aim them to take decisions. However, up to now they 
cannot compared to normal computer because of their sizes and 
due to their resources. With this restriction, the execution time 
for any application became very important, one of the users 
problems it to find the shortest routes because the (shortest) 
word mean minimum cost, time and effort. 

The shortest path problem can be solve by many algorithms 
but because of resource limitation of devices, the evolutionary 
algorithm introduced which they have grown to extremely ef¬ 
fective means for resolving optimization problems. 

II. BACKGROUND 
A. GENETIC ALGORITHM 

The genetic algorithm is a biologically inspired heuristic search 
approach to find accurate or approximate solutions [1-9]. The 
genetic algorithm has on a wide-ranging of applications, includ¬ 
ing Robotics, finances, Planning and Scheduling [1] [2], pattern 
recognition [3], Engineering Designs [4] [5], etc. 

The Steps of the genetic algorithm can be briefly stated as fol¬ 
lows: 

• Population initialize 

Software Engineering Department, College of Computer Science and Mathe¬ 
matics, University of Mosul, Mosul, Iraq. (E-mail: m.a.taha@ uomosul.edu.iq). 


• Select the fitness function 

• Evaluate each individual in the population with fitness 
function 

• Select the top graded part to breed 

• Breed new generation using crossover or mutation 

• Replace the worst graded part of population 

• Repeat until reach a termination condition. 

Fitness 

In the fitness computation step, all individuals of a solution 
must evaluated on a fitness function. The fitness function 
measures the quality of the individuals that has generated by 
Genetic Algorithm. 

Selection 

To access the optimal solutions, the best children solutions must 
selected to be parents in the new population. The selection op¬ 
eration depends on the fitness values in the population. 

Crossover 

Crossover is the process of combination of the genetic material 
of two or more individual solutions. It splits up two individual 
at n positions and interchangeably assembles them to a new one 
[ 6 ]. 

Mutation 

Mutation process based on random changes for individuals [6]. 
The strength of this disturbance is to keep the solution away of 
local optima. The mutation process could be controlled by mu¬ 
tation rate according to solution spaces. 

B. SHORTEST PATH PROBLEM 

The shortest path problem is concern with finding a route be¬ 
tween two vertices (nodes) in a graph that have the minimum 
summation of weights of edges. This problem can be solved 
simply by Breadth First Search if all edge weights = 1, but here 
weights can take any value (traffic circumstances). There are 
many well-known algorithms like Bellman Ford, Dijkstra and 
Floyd-Warshall Algorithms [7] for solving the problem but 
when the vertex numbers become too large, the running time 
will be undesirable. 

Chang et al. [8] in 2002 suggested a genetic algorithm to re¬ 
solve for the shortest path problem. The problem is described 
as follows: 


158 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 



International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 


The network can be defined as a G = (N, A), where N is the set 
of n nodes (vertices) and A is the set of edges. C=m[Cij] is the 
cost matrix, where Cij is cost from node i to node j. The S and 
D represent the Source and Destination respectively. The Iij 
(link indicator) shows whether a route exists between node i and 
node j. If there is a route, then Iij = 1, otherwise, Iij = 0. 

III. Problem description and solution 

the experiments use SUMO simulator [9] to build a road map 
grid that include nodes and links which are represent junctions 
and streets respectively (Figure 1 shows a sample of 5x5 node 
grid and junctions that used in one of experiments designed us¬ 
ing SUMO), each junction is controlled by TLS (traffic light 
system) so we have a grid of traffic lights and that will cause 
some random obstructions of traffic flow in all streets. All 
streets (links) are Equipped with induction loop detector(El) to 
collect the traffic flow data of the street. All the weights in 
weights matrix will be updated when the vehicle cross one junc¬ 
tion from the chosen path and that will affect the fitness func¬ 
tion and taken decision for next node. 


The fitness function is the quality of the chromosome; there¬ 
fore, the fitness function is very important for next generation. 
Ahn and Ramakrishna [8] described the fitness formula as fol¬ 
lows: 


/co =- et 


P(g(U)<g(U + i)) 


v (. 9{Uj),g(i,j +1)) * (1 - Y(g(i,j),g(i,j + 1)) 


where 

f(i): ith chromosome fitness; 
g(i,j): ith chromosome’s j th gene ; 

L: chromosome length; 

D: distance between two nodes; 

V: velocity limit between two nodes; 

Y: Density between two nodes 

In all experiments the distance and velocity will be constant and 
only the density parameter will be variable and will be updated 
in real time from SUMO detectors. 



Figure 1: 5x5 Node grid and Junctions 


The route from Source Node to Destination node will be repre¬ 
sented in chromosome with variable length of chromosome. 
Figure 2 shows the chromosome structure and that each gene 
(expect gene 0 for programming needs) represent a node be¬ 
tween S and D, so all the nodes that have 0 value in I link indi¬ 
cator matrix will be Excluded from the chromosome. 

The Fitness function 

locus 
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Figure 2: chromosome structure 


Crossover 

The algorithm will search for same gene value on two chromo¬ 
somes in crossover to specify the probable split points, and thus, 
select one of the points randomly. All the genes after the se¬ 
lected point will be exchanged between chromosome 1 and 
chromosome 2. 

The crossover points are usually different for each parent chro¬ 
mosomes in the crossover phase, in that the crossover may 
cause a loop that repeated some node as a gene of chromo¬ 
somes. and those chromosomes will be canceled without affect 
the crossover rate. 

Mutation 

Mutation operation ensures the genetic variety of the popula¬ 
tion and keep the solution away of local optima. To apply mu¬ 
tation, at each loop (chromosomes) cycle a random number be¬ 
tween 0 and 1 is generated and compared to the mutation rate, 
if less then mutation rate then a mutation point(node) will be 
selected randomly for the selected chromosome. All the genes 
before the selected point will be fixed and all genes after the 
selected point to the S will be generated randomly by initialize 
function that is used to create the random population. 

Algorithm steps 

The algorithm for the working system is as follows: 

Begin 

SI: Initialize (population size, mutation rate, crossover rate). 
S2: Read C values. 

S3: Generate the initial population. 

S4: Compute fitness for all chromosomes. 

S5: Count = 0, G = 1. 

S6: loop 

S7: Random selection. 

S8: Crossover (chromosomes). 

S9: Mutation (chromosomes). 

S10: Compute fitness for all new chromosomes. 

SI 1: If (minf [Generation_now] = minf [Generation_prev]) 
Counter++; Else stop loop. 
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S12: Count = 0; 

S13: G++; Jump to S7. 
End 


IV. The result 

The system had been developed using C# programming lan¬ 
guage and connected to SUMO using TraCI [10].Each experi¬ 
ment repeat 50 times and different number of grid imple- 
mented(5x5,10x10,15x15,25x25,50x50) for all the distance and 
velocity between nodes are constant. 

Tables 1 to 5 Show the number of generation needed for each 
experiment with different crossover and mutation rate. 
According to the result, the two parameters (crossover and mu¬ 
tation) are control the time and number of generation required 
to access the goal because if we increase the mutation or cross¬ 
over rate that will increase the execution time require to gener¬ 
ate the new chromosomes and compare them with fitness func¬ 
tion. For most experiment, the best execution times recorded 
with crossover rate between 20-30% and mutation rate between 
1 - 2 %. 


grid 

Mutation 
rate % 

Crossover 
rate % 

Generation 



10 

7 



20 

6 


1 

30 

6 



40 

6 



50 

5 



10 

6 



20 

6 


2 

30 

6 



40 

5 



50 

5 



10 

5 



20 

5 

5x5 

3 

30 

4 



40 

4 



50 

3 



10 

5 



20 

4 


4 

30 

4 



40 

3 



50 

2 



10 

3 



20 

3 


5 

30 

2 



40 

2 



50 

2 


Table 1: 5x5 grid results 


grid 

Mutation 
rate % 

Crossover 
rate % 

generation 



10 

12 



20 

12 


1 

30 

11 



40 

11 



50 

11 



10 

11 



20 

11 


2 

30 

10 



40 

10 



50 

8 



10 

9 



20 

9 

10x10 

3 

30 

7 



40 

7 



50 

6 



10 

8 



20 

7 


4 

30 

6 



40 

5 



50 

5 



10 

4 



20 

4 


5 

30 

3 



40 

3 



50 

3 


Table 2: 10x10 grid results 


grid 

Mutation 
rate % 

Crossover 
rate % 

generation 



10 

29 



20 

27 


1 

30 

27 



40 

25 



50 

23 



10 

27 



20 

27 


2 

30 

24 

15x15 


40 

22 


50 

20 



10 

27 



20 

26 


3 

30 

24 



40 

21 



50 

20 



10 

25 


4 

20 

25 



30 

23 
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40 

21 

50 

19 



10 

24 



20 

23 


5 

30 

21 



40 

19 



50 

19 


Table 3: 15x15 grid results 


grid 

Mutation 
rate % 

Crossover 
rate % 

generation 



10 

32 



20 

31 


1 

30 

31 



40 

30 



50 

28 



10 

31 



20 

30 


2 

30 

29 



40 

28 



50 

27 



10 

25 



20 

25 

25x25 

3 

30 

23 



40 

22 



50 

22 



10 

24 



20 

24 


4 

30 

23 



40 

22 



50 

21 



10 

22 



20 

21 


5 

30 

20 



40 

18 



50 

16 


Table 4: 25x25 grid results 


grid 

Mutation 
rate % 

Crossover 
rate % 

generation 



10 

40 



20 

39 


1 

30 

39 



40 

38 



50 

37 



10 

39 

50x50 


20 

38 

2 

30 

37 



40 

36 



50 

36 



10 

37 


3 

20 

36 


30 

35 



40 

33 




50 

32 



10 

33 



20 

32 


4 

30 

31 



40 

30 



50 

28 



10 

30 



20 

29 


5 

30 

26 



40 

25 



50 

24 


Table 5: 50x50 grid results 
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Abstract - Rapid changes in the technology lead to increased variety of data sources. These varied data sources 
generating data in the large volume and with extremely high speed. To accommodate and use this data in decision 
making systems is the big challenge. To make fullest use of the valuable data generated by different systems, target 
users of the analysis systems need to be increased. In general knowledge discovery process using the tools which are 
available requires the handsome expertise in the domain as well as in the technology. The project ITDA (Integrated 
Tool for Data Analysis) focuses to provide the complete platform for multidimensional data analysis to enhance the 
decision making process in every domain. This projects provides all the techniques required to perform 
multidimensional data analysis and avoids the overheads occurred by the traditional cube architecture followed by 
most of the analytics system. Modelling the available data in the multidimensional form is the basis and crucial step 
for multidimensional analysis. This work describes the multidimensional modelling aspect and its implementation 
using ITDA project. 

Keywords - Multidimensional data analysis, cube, data mining, machine learning, ETL, multidimensional modelling, 
OLAP. 


I. Introduction 

Due to increased frequency of data generation, data under consideration of analysis is also goes on increasing 
tremendously. The large size of the data and complexity in data analysis demands an easy platform so that 
researchers and domain experts can do analysis on their data without the hard core knowledge of information 
technology. Ad hoc querying or ad hoc reporting is the main need of data analysis. To achieve this data 
modeling is essential task if the system wants to facilitate the variety of domains. Multidimensional data 
modeling is the way to provide facility to perform ad hoc analysis. Analyzing multidimensional data is of 
growing need to extract the knowledge and hence to enable the decision making in various domains. Data 
analysis process which leads to the enhanced decision making, combine various techniques like statistical 
techniques, data mining algorithms and machine learning techniques. With all these techniques, presentation of 
analysis output with attractive visuals is a key part of popular analytics systems. Most of the current 
multidimensional systems rely on data cubes which are very much resource and time intensive. In this context, 
ITDA architecture is the solution for multidimensional analysis with the reduced memory and time overheads as 
compared to the existing systems. 

Absorption of high volume of data from variety of sources requires the robust and flexible system. In OLAP 
terminology the data modelling and data absorption system is called as the Extraction-Transformation-Loading 
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(ETL) process. The most important bi-product of the ETL process is the metadata. ITDA system uses the on - 
the - fly architecture for the query generation and hence metadata of the multidimensional model is very crucial 
component of the system. In a typical analysis environment ETL processes are performed in an ad-hoc, in house 
fashion or by using some specialized ETL tools. General functionalities of all these tools are identification of 
relevant information present at the source, extraction of this information, customization and integration of the 
information coming from multiple sources into a common format, cleansing of the final data set, on the basis of 
database and business rules, and propagation of the data to relational database which will be used for analysis. 
In current scenario, organizations might be having number of sources contributing to data collection playing 
important role in modelling process. The source data might be at different places and it is needed to extract all 
necessary and data relevant for the analysis. After applying the transformations according to business rule the 
data is transferred into the target model. 

The paper is focusing on this important aspect of any decision making tool, i.e. modeling the data in analysis 
ready form which may be residing at varied location and may have heterogeneous formats. The organization of 
this paper is as follows. Firstly in section I we discusses the related work in this area. In section II we give brief 
introduction of the architecture of the ITDA project along with the basic characteristics of it. In Section III we 
discuss the conceptual design of the ETL process for ITDA. In next section, section IV we discusses the 
implementation of the process by considering the case study where data is available in transformed format. And 
finally we summarize all the contents and discuss the future scope of the system. 

II. Related Work 

Multidimensional data analysis system to enhance the efficiency and accuracy of the decision support system 
is the growing need of today. Many big players of technology like, IBM, Microsoft are having good range of 
solutions for the same. Every solution is having its own pros and cons. As discussed in [1] most of the 
multidimensional analysis tools are having stiff curve of learning. Many tools are domain specific. The tools 
which are having good range of analytical options generally provide the different components for each and 
every facility which de-motivate the non expert data analysts. 

Microstrategy is the leading name in the market of data analysis. Microstrategy provides the component 
called integrity manager which takes care of the ETL process. It replaces the traditional manual process of data 
integration. ETL process is the separate component in this tool. Numbers of supporting ETL components are 
available in the Microstrategy; like, Enterprise Manager ETL, ETL Server, ETL Support, etc. But this may lead 
to a bit complicated and a costly affair for the research community those are focusing more on analytics and less 
on technology. [3] 

IBM Cognos is also very powerful tool available in the market to perform the multidimensional data analysis. 
IBM cognos is having different components for each feature like, Cognos for analytics, for business 
intelligence, for predictive analysis, etc. Cognos Analytics is having a separate data modelling component. This 
component provides the interface for data extraction from various sources, for transformations and for data 
validations. [6] 

ETL process of ITDA is the integral part of the system to avoid the additional installation and usage overhead 
of the user. 
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III. ITDA System Architecture 

The ITDA system is basically designed to facilitate the researchers and data analyst with the complete 
package of multidimensional reporting, statistical processing, data mining, machine learning and visualization. 
This is achieved by the web based system with user friendly and secured environment for the data analyst. This 
system is functionally independent; it does not require any additional external component or system to complete 
the task. Also the components of this system are integrated and there is no need to install any of the components 
separately, which is often common for most of analytics tools. 

ITDA system architecture is mainly divided in two parts, data modeling part and data analysis part. Proposed 
system consists of two main parts containing various components. First is data absorption from different data 
sources, collection of metadata, and formation of multidimensional model and second is multidimensional 
analysis on modeled data which further extends to perform statistical analysis and data mining. 

Data modeling functionality mainly includes the extraction, transformation and loading (ETL) process. 
Source data is given to the ETL process and it produces the ready to analyze data. ETL process is responsible to 
extract the data resides on various sources and in variety of formats. It also performs cleansing and 
customization of data according to the analysis needs. This process is also responsible to generate the metadata 
of the ready to analyze data. The proposed system is not going to store the data and the aggregations, hence 
metadata is having crucial role in this system. Aggregations can be generated on - the - fly by using the 
metadata. 

A. ITDA Characteristics 
Customized modeling of the data 

Multidimensional modelling of the data according to the business needs is the key of any efficient decision 
making system. ITDA supports the multiuser system. Each user can model the data in its own way according to 
business need. In the ITDA terminology the information of the model is conceptualize as the ‘environment’. 
Single user can have multiple environments for same data so that user gets various views of data for analysis 
without having complexity of handling number of users for separate business need. 

Data absorption options 

ITDA system can accommodate pre-processed data present in flat files where transformation is not required. 
For such cases it directly loads data in server and collects metadata for that environment. If data is present in 
multiple sites then this system performs ETL processing during environment creation. 

Flexibility in data selection 

Data analyst can have analysis on some particular portion of data by using horizontal partitioning facility 
given in this system. It allows the user to analyze particular snippet of dataset. It increases performance by 
reducing number of rows used while running analytical queries or algorithms. User can directly get particular 
portion of uploaded data by using row filter utility given in the system. This utility allows user to build row 
filter query without requirement of prior SQL query knowledge. Both these facilities are integrated with the 
system which can be used after creation of environment. 
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IV. ITDA ETL Process: Conceptual Design 

Process of ETL starts with the understanding of business requirements and the objective of the organization 
followed by modelling and design of environment for that organization. Modelling and design are defined as 
representation of key business measurements around its dimensions using dimension modelling. This process 
decides the level of complexity of transformation based on the source of the data. If data is present at multiple 
sites then ITDA provides the technique which takes care of extraction of data from multiple sources, 
transformation and loading. 

Last stage of conceptual design part is metadata generation. Metadata contents need to be formulated for a 
specific multidimensional model. The process decides relationship described by the dimensions, like 
hierarchical relationship, or sequential relationships. It also gives the level of relationship exists in each 
component of the dimensional structure. ITDA produces the flat file at the end of the process containing the 
complete metadata for a multidimensional model created by the user. It also stores the information of the 
temporal component to create the run time summaries. 

A. ETL Algorithm 

During the implementation of the ETL process in ITDA, every correct or missed step is recorded and made 
available to the user. 

1) Finalize the ETL processing path 

2) Finalize the type of data source 

3) For each any data source map the data source attributes with the dimensional attributes 

4) Preparation of metadata 

5) Preparation of configuration file for further processing of model 

One of the basic motives behind the ITDA project is to provide the multidimensional analysis platform for 
non expert data analyst community along with the expert data scientist. The project focuses on interactive, user 
friendly implementation of the ETL process. 

ETL processing path decides whether the data sources are at the same site or located on the different sites. If 
the data sources are located at different location then the user needs to create one configuration file and based on 
the instructions given in the file; data will be absorbed. If the data source is at single location then next step is to 
decide the type of data source like flat file or any other database. Mapping of data source attributes and 
dimension and fact values are performed and then metadata is generated. 

V Use of ETL service: Case study for flat file and database 

ITDA implements the ETL process with highly interactive and user friendly interface. It covers complete ETL 
process without any programming aspect. Fig 1 gives the main interface of the ITDA system which allows user 
to initiate the creation of new environment in the system. 
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Fig 1ITDA user interface - option to create new environment 
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Fig. 2 ITDA interface - selection of ETL processing path 


Fig 2 shows the interface which provides the two different paths for ETL process. If the data is available in 
the already transformed form according to business rule then user will go for the ‘Simple Upload’ option. If the 
data needs to be extracted from various sources and need the pre-processing according to business rule then the 
‘Steps Upload’ option can be the choice. 

A. Simple upload 

This module assumes that data is already in the required form and there is no need of transformation step. For 
single source data, we can have data in flat files or in database server. 

B. Flat files 

Generally spread sheets or text file formats are used to export data from any database server. If source 
machine is not accessible from remote location, user can export data in flat files and use those files to create new 
environment (multidimensional model) in this web tool. ITDA allows user to have data in standard comma 
separated file or any other flat file with any type of separator. User is allowed to see sample data. Standard 
query is generated by the system so that user can drop some of the unwanted columns. Successful creation of 
table enables the metadata collection interface. Figure 3 shows the user interface for uploading csv files to the 
server. Figure 4 shows the interface with the sample data from the selected file and the standard query generated 
by the system to extract the file. Analyst can customize the query further. 
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Fig. 3 ITDA interface - selecting file as the data source 
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Fig. 4 ITDA interface - sample data and editable extraction query 


C. Database 

When data source is any database then connection details can be provided so that system can access the data. 
In this module if source connection and destination connection are same then it will skip data migration process. 
It avoids extra overload of unnecessary copying entire table. As we are using on the fly architecture, we can use 
source table for analysis. For analysis we are going to read existing data in OLTP server. In on the fly 
architecture, we can use same server for OLTP and OLAP processing. This is the biggest advantage of using on 
the fly architecture. Figure 5 shows database option available in simple upload module. 



Fig. 5 ITDA interface - options to map source database for data extraction 
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D. Steps Upload 

If we have to take data from multiple sources then we need to have extraction and transformation logic at the 
server side. In simple upload we were getting pre-processed data so it was much easier to load data in the server 
and collect metadata. To support collection of data from multiple sites we have steps upload module. This 
module takes care of extraction, transformation and loading of data at server. In this, we need to upload 
configuration file to the server containing all details and transformation scripts. We will get all configuration file 
parameters at the time of conceptual design. Configuration file is in simple text format so that any database user 
can build it. It is required to keep the process of environment creation as easy as possible. Figure 6 shows steps 
upload choice 


# MultiDUnenskifiaJ Analysis maw f™ scoth 


Create Environment 


simple Upload I tlplaatl ty' Steps 


Qrawv... i.afte-wlKled 


Fig. 6 ITDA interface - option when data needs transformation steps 


E. Metadata Collection 

In order to have multiuser system it is needed to maintain context of every user separately. ITDA ETL process 
uses specific directory structure for maintaining all the environments created by any user. For every 
environment there is one flat file for storing the customized operations built by user for performing OLAP. This 
file is retaining for each environment separately to avoid clash. All the necessary information to operate the 
models created by a user, a separate directory structure is provided to every user. When user registered for the 
first time to the system, this complete directory structure is created for that user. 

To give metadata user can fill simple html form control to mention dimension names, their hierarchies and 
time dimension details. Once this data is inserted system can proceed for environment creation. 

F. ETL for periodic updates 

For any ETL system updating data in the server is crucial part. Since OLTP servers will be generating new 
data continuously. To have analysis on updated data; either the system will change the data available in the 
server or will add new data keeping earlier data as it is. Here important thing is environment metadata is not 
changing so metadata collection process can be skipped and directly system can update the data in the required 
environment. 


169 


https://sites.google.com/site/ijcsis/ 
ISSN 1947-5500 













International Journal of Computer Science and Information Security (IJCSIS), 
Vol. 16, No. 1, January 2018 



Fig. 7 ITDA interface - options for edit environment 

Each user can create any number of environments based on the analysis need. Update process will be invoked 
for each separate environment. Figure 7 shows environment selection interface and various operations that user 
can perform after selecting it. 

Here this module is for loading new data to the same environment. Figure 8 shows the result after uploading 
new dataset file to the server. This module flush older data from the table and inserts new data. 



Fig. 8 ITDA interface - edit environment option 

Conclusion and future work 

Design of ETL process in ITDA addresses the requirements of efficient extraction, transformation and loading 
of data from various sources. It also meets with the challenges in assimilating data from heterogeneous data 
sources, provides an easy to use tool for uploading the existing data set in hand. It successfully collects all 
metadata parameters required for multidimensional analysis. The designed ETL model can be extended to 
include facilities of automatic multidimensional modelling where automatic extraction of metadata will be done 
at the time of load. It can also have context based data generation which collects as well as models the data 
gathered from web. This data can in turn be tunnelled to multidimensional analysis. 
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Abstract —In this paper we represent a modified Generalized 
Regression Artificial Neural net that can recognize all breast 
cancer of Wisconsin Diagnostic Breast Cancer and Wisconsin 
Prognostic Breast Cancer correctly. In this method the modified 
Neural Net trained with 50% of data & 50% for test. But the 
result is the ability of classify with 100% accuracy. The all 50% 
train & test data chosen randomly. 


AW M.k = -'7 p* * (- 2 * « * f -9 qi )* f q .k * )* fp.j 

Z “ 2 *a*(T q -<p qk )*<p qk * 

q =i 
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This method is based on the fact that calculation in float numbers 
will remove accuracy. By reducing the number of calculation the 
accuracy of result increase significantly. 

Keywords-neural network , Generalized regression neural 
network(GRNN), absolute distance. 


The Wp,q is the weights between second & third layer & 
Wh,p is the weights between first & second layer. 

T he structure of GRNN Neural net shown in figure(l). The 
learning algorithm shown in (2). 


I. Introduction (Heading 1) 

Pattern classification problems are important application 
areas of neural networks used as learning systems [1],[2],[3]. 
Multilayer Perceptrons (MLP), radial basis functions (RBF), 
probabilistic neural networks (PNN), self-organization maps 
(SOM), cellular neural networks (CNN), recurrent neural 
networks and conic section function neural network (CSFNN) 
are some of these neural networks. In addition to classification 
problems, function approximation problems are also solved 
with neural networks. Generalized regression neural network 
(GRNN) is one of the most popular neural network, used for 
function approximation. GRNN and PNN are kinds of radial 
basis function neural networks (RBF-NN) with one pass 
learning [1]. However they are similar; PNN is used for 
classification where GRNN is used for continuous function 
approximation [4]. But in this paper we use GRNN for 
recognizing. 


Y(x) = 


Z>(*,x k ) 


( 2 ) 


The pros of GRNN is that it can learn in one train. The cons 
is that it need to save all training data & in some case this need 
big memory. The K(x,Xk) is radial base function & the formula 
for K(x,Xk) is shown in (3). Yk is the prediction value for Xk. 
Y(x) is the prediction value for x. 

where d k is the squared Euclidean distance between the 
training samples X k and the input x In huge data the error 
increase because of calculating the Euclidean distance. 

K(x,x k ) = e' V2cT \ d k =(x-x k ) T (x —x k ). (3) 


II. Related work 

In Back Propagation neural net the neurons trained with 
gradient descent algorithm the final weight change is (1) 
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Figure(l) The GRNN Neural Network 


III. Proposed work 

In the proposed method instead of calculation of Euclidean 
distance we use absolute distance between samples Xk & input 
X. 

IV. RESULT 

To test the simulation we use the Wisconsin Diagnostic 
Breast Cancer and Wisconsin Prognostic Breast Cancer. First 
we use 50% of data to train & 50% for test. Then we go one 
more step & use 40% for train & 60% for test. The result 
shown in table 1 & table 2. 

The result of GRNN enhanced by changing the Euclidean 
distance to Absolute distance. 


TABLE I. Result Of Simulation for 50% train & 50% test. 


dataset 

WDBC 

WPBC 

Comment 

Number of 

instances 

569 

198 


Train Percent 

50% 

50% 


Test Percent 

50% 

50% 


Back propagation 

95.08% 

95.08% 

Hidden = 10 

Linear SVM 

78.12% 

79.13% 


Euclidean distance 
GRNN 

94.16% 

96.14% 

II 

o 

Absolute distance 
GRNN 

100% 

100% 

II 

o 


TABLE II. Result Of Simulation for 40% train & 60% test. 


dataset 

WDBC 

WPBC 

Comment 

Number of 

instances 

569 

198 


Train Percent 

40% 

40% 


Test Percent 

60% 

60% 


Back propagation 

90.02% 

91.49% 

Hidden = 10 

Linear SVM 

78.44% 

79.13% 


Euclidean distance 
GRNN 

93.84% 

93.84% 

3 

II 

o 

Absolute distance 
GRNN 

100% 

100% 

II 

o 


V. Conclusion 

In real world we need big data. If we use many calculation 
the accuracy of computers become low. By reducing the 
number of calculation we improve the accuracy. 
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